# Using Scikit Learns 'make_classification' 

The make_classification function is used to create a dataset for classification.  This notebook shows you how to use make_classification function to create a random sample dataset to work with.

The documentation for this dataset can be found on the [scikit-learn website](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html)

Some of the options seem straightforward but other not so.

- *n_features* provides for how many feature columns will be in the dataset.
- *n_informative* indicates how many of the features are actually informative.  if *n_informative* is less than *n_features* then the resulting dataset will features that do not add new information, and that can be identified through feature selection techniques.
- *n_redundant* the number of redundant features.

In [149]:
from sklearn.datasets import make_classification

Lets start will a very well behaved dataset

- 3 features
- all features are informatives
- there are not redundant features
- 2 target classes
- the distribution is 50/50 of the target classes

In [150]:

features, target = make_classification(n_samples = 3000, 
                                       n_features = 3,
                                       n_informative = 3,
                                       n_redundant = 0,
                                       n_classes = 2,
                                       weights = [0.5, 0.5],
                                       random_state=1)

In [151]:
print(f'Feature Matrix:\n {features[:3]}')

Feature Matrix:
 [[-0.02837016 -1.17901771 -1.9924315 ]
 [ 1.48936958 -1.35588181 -1.54431898]
 [ 0.22795969  0.30478455  0.84319136]]


In [152]:
print(f'Target Vector: \n {target[:3]}')

Target Vector: 
 [1 0 0]


## K-nearest Neighbors

In [153]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier


In [154]:
X = features
y = target
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)

# remember, cross_val_score, will stratify the training/testing set because the model used is a classification model.
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
print(f"Average score: {scores.mean()}")

[0.91694352 0.92026578 0.910299   0.91       0.88       0.90333333
 0.91666667 0.88294314 0.88628763 0.89966555]
Average score: 0.9026404626718074


## DecisionTreeClassifier

In [155]:

tree = DecisionTreeClassifier(max_depth=5, random_state=0)
# remember, cross_val_score, will stratify the training/testing set because the model used is a classification model.
scores = cross_val_score(tree, X, y, cv=10, scoring='accuracy')
print(scores)
print(f"Average score: {scores.mean()}")

[0.85714286 0.90365449 0.88039867 0.85666667 0.85       0.90333333
 0.86       0.85953177 0.8729097  0.88628763]
Average score: 0.8729925110279003


### Summary

For the given dataset from make_classification, we have the following results:

- KNN CrossValidation Accuracy:    0.9026404626718074
- DecisionTreeClassifier Accuracy: 0.8729925110279003

Lets take a sample from train_test_split and see what the confusion matrix looks like for that sample.

In [156]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21)

#### KNN

In [157]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

knn_confusion = confusion_matrix(y_test, knn_predictions)
knn_score = knn.score(X_test, y_test)

print(f"Logistic Regression Confusion Matrix:\n{knn_confusion}")
print(f'{knn_score}')

Logistic Regression Confusion Matrix:
[[329  42]
 [ 27 352]]
0.908


#### DecisionTree

In [158]:
tree = DecisionTreeClassifier(max_depth=5, random_state=0)
tree.fit(X_train, y_train)
tree_predictions = tree.predict(X_test)
tree_confusion = confusion_matrix(y_test, tree_predictions)
tree_score = tree.score(X_test, y_test)

print(f"Tree Confusion Matrix:\n{tree_confusion}")
print(f'{tree_score}')

Tree Confusion Matrix:
[[310  61]
 [ 28 351]]
0.8813333333333333


### Summary

For the train_test_split of the sample data, the KNN model still performs better in terms of accuracy, but now we can see the False Positives and False Negatives.  From here we have to decide if we need to account for these based on our business problem.  Is it ok to have more False Negatives or False Positives?


# TODO

- add another example of where the number of informative features is not the same as the number of features and use the techniques from the earlier notebooks to reduce the features.

