# Classification
Classification is the term used to define machine learning where the goal is to label a data entry based on it's features. Learning happens based on known examples. This is called labeled data. The concept of learning using known examples is called supervised learning.

Scikit-Learn is a Python package that provides some machine learning functions. Structures such as decisions trees are easy to use with minimal code. 

## Apples and Oranges
Lets use the apples and oranges example to demonstrate how data can be classified.


In [99]:
from sklearn import tree

# features = [[140, "smooth"], [130, "smooth"], [150, "bumpy"], [170, "bumpy"]]
# labels = ["apple", "apple", "orange", "orange"]

# smooth : 1, bumpy : 0
# apple : 0, orange : 1

#features = [[140, 1], [130, 1], [150, 0], [170, 0]]
# Checking with more features
# labels = [0, 0, 1, 1]
features = [[140, 1], [130, 1], [150, 0], [170, 0],[180, 1], [120, 1], [142, 0], [150, 0]]
labels = [0, 1, 1, 1, 1, 1, 1, 0]

classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(features, labels)

print(classifier.predict([[145, 0]]))

[1]


## Iris Flower Dataset
Lets use what we've learned from the simple example above on a real dataset.

In [106]:
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# Instead of reading from our CSV file, Scikit-Learn provides us with the Iris dataset
iris = load_iris()
test_idx = [0, 50, 100]

# Training data
training_target = np.delete(iris.target, test_idx)
training_data = np.delete(iris.data, test_idx, axis = 0)

# Testing data
testing_target = iris.target[test_idx]
testing_data = iris.data[test_idx]

classifier = tree.DecisionTreeClassifier()
classifier.fit(training_data, training_target)

print(classifier.predict(testing_data[:1]))
print(testing_data[0], testing_target[0])
print(iris.feature_names, iris.target_names)

[0]
[ 5.1  3.5  1.4  0.2] 0
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ['setosa' 'versicolor' 'virginica']


## Password Strength Machine Learning
Provided is a dataset of passwords. This can be used to extract properties to determine the strength of the password.

### Rules for Password Strength
These are the rules that we will use for password strength. Passwords have a strength between 0 and 4.
Lets assign points to passwords.
* A password with a character length of 8 or more gets 1 point.
* A password containing a special character, an uppercase alphabet character, and a numeric character gets an additional 3 points.
* A password containing an uppercase alphabet character, and a numeric character gets an additional 1 points.
* A password containing a special character, and a numeric character gets an additional 2 points.
* A password containing a special character, and an uppercase alphabet character gets an additional 2 points.
* A password that does not meet any of these requirements gets a 0 points.

### The Dataset
The following data sets are provided:
* **Training:** A training dataset for training your machine learning algorithm.
* **Testing:** A training dataset for testing your machine learning algorithm.

### Preparing the Data
The passwords as they exist have no meaning to a machine learning algorithm.
You will need to extract properties of the password that can be used.
Example: Number of special character.

Use the dataset: datasets/Passwords.csv

In [96]:
import csv
import numpy as np
from sklearn import tree

# Read data to lists and identify features

# Plot and understand your features

# Make a Scikit Learn classifier

# Fit the model using your classifier

# Determine level of correctness
