# Chapter 1

### KNN

```
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# First convert dataset to numpy since sklearn uses numpy
y = df['target'].values
X = df.drop('target', axis=1).values
# Normalize the whole dataset before modeling
X = preprocessing\
	.StandardScaler()\
	.fit(X)\
	.transform(X.astype(float))
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
# Initialize and train model
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='minkowski')
knn.fit(X_train, y_train)
# Predict the test set class with the trained model
predicted_y = knn.predict(X_test)
# Measure probability score of prediction for the test set with the trained model
predicted_y_prob = knn.predict_proba(X_test)
# Measure accuracy on testing set
print(accuracy_score(y_test, predicted_y)*100)
# Visualize normal distribution of accuracy for different Ks
# Compute the above steps for different K and find mean, std etc
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
# Plot complexity graph with list of train and test accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
```

### Logistic Regression

```
from sklearn.metrics import confusion_matrix
# Specify independent and dependent features
X = np.asarray(df[['A', 'B', 'C', 'D', 'E', 'F', 'G']])
y = np.asarray(df['target'])

# Preprocess dataset
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)

# Split into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)

# Train the model
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=0.01, solver='liblinear')
LR.fit(X_train,y_train)

# Predict the test set
y_pred = LR.predict(X_test)

# See classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
classification_report(y_test, y_pred)
confusion_matrix(y_test, y_pred, labels=[1,0])

# Predicted probability on test set for positive/target class
y_pred_prob = LR.predict_proba(X_test)[:, 1]

# Evaluate the model
from sklearn.metrics import jaccard_score
jaccard_score(y_test, y_pred,pos_label=0)

from sklearn.metrics import log_loss
log_loss(y_test, y_pred_prob)

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_prob))
```

### SVM

```
# Method 1
from sklearn.svm import LinearSVC
# OR from sklearn.svm import SVC 
# instatiate a scikit-learn SVM model
# to indicate the class imbalance at fit time, set class_weight='balanced'
# for reproducible output across multiple function calls, set random_state to a given integer value
svm = LinearSVC(class_weight='balanced', random_state=42, loss="hinge", fit_intercept=False) 
# svm = SVC(kernel='linear', gamma=.5, probability=True)  # Another way
# train a linear Support Vector Machine model using Scikit-Learn
t0 = time.time()
svm.fit(X_train, y_train)
sklearn_time = time.time() - t0

# Method 2 : Use snapml library
# in contrast to scikit-learn's LinearSVC, Snap ML offers multi-threaded CPU/GPU training of SVMs
from snapml import SupportVectorMachine
snapml_svm_gpu = SupportVectorMachine(class_weight='balanced', random_state=42, use_gpu=True, fit_intercept=False)
snapml_svm_cpu = SupportVectorMachine(class_weight='balanced', random_state=42, n_jobs=4, fit_intercept=False)
t0 = time.time()
model = snapml_svm_cpu.fit(X_train, y_train)
snapml_time = time.time() - t0

# Predict
y_pred = svm.predict(X_test)

# Evaluate model
roc_auc_score(y_test, y_pred)

# Get confidence score for probability
y_pred_conf = svm.decision_function(X_test)

# Evaluate hinge loss
hinge_loss(y_test, y_pred_conf)
```

### Key Terms / Jargons

- decision boundary: the surface separating different predicted classes
- linear classifier: a classifier that learns linear decision boundaries
- linearly separable: a data set can be perfectly explained by a linear classifier


<center><img src="images/01.01.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/01.02.png"  style="width: 400px, height: 300px;"/></center>

### Overfitting, Underfitting, Bias variance tradeoff

- Overfitting : 
    - Model also memorises / trains on noise that resides within training data. 
    - Model performs well when evaluating on training data but does not perform well on unseen data
    - High variance is responsible for this error because of also capturing noise.
    - Diagnosis: cross-val prediction on test set has high error than prediction on train set
    - Possible remedy : Decrease model complexity, gather more data, 
- Underfitting :
    - Model is too simple to catch the pattern, model is not good enough to capture the underlying pattern.
    - Model is bad on both training and unseen data
    - Model is not flexibple enough to approximate the prediction values
    - High bias is responsible for this error
    - Diagnosis: cross-val prediction on train and test set are roughly equal but have very high errors that is undesirable
    - Possible remedy : Increase model complexity, gather more features, 
- Bias-Variance trade-off :
    - Generalization error = bias^2 + variance + irreducable error (noise)
    - bias = error term that tells how on average real value is different from predicted value
    - variance = error term that tells how predicted value varies over different training sets
    - When model complexity increases, variance increases and bias decreases
    - When model complexity decreases, variance decreases and bias increases
    - The sweet spot is the minimised generalization error, which gives the optimised model