In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### read the raw data

In [2]:
data = pd.read_csv('/Users/Aaron_hill/repos/visualizedata/ml/week01/metadata/definitions.csv')

# create a boolean indicator =True if the definition is of "machine learning"
data['ml'] = data['term'] == 'machine learning'

print(data.dtypes)
print(data.head(10))

term          object
definition    object
ml              bool
dtype: object
               term                                         definition    ml
0  machine learning  Machine learning is the development of functio...  True
1  machine learning  it is training an algorithm on how to make pre...  True
2  machine learning  Computers making decisions and learning throug...  True
3  machine learning  Creating a model to teach an algorithm to reco...  True
4  machine learning  Applying a model to a large amount of data to ...  True
5  machine learning  A subset/practice of Artificial Intelligence w...  True
6  machine learning   let machine read a lot of data, and conclude ...  True
7  machine learning  machine process of aggregating information in ...  True
8  machine learning  Automatic or semi-automatic updating of equati...  True
9  machine learning  Machine Learning is a method of computation th...  True


### <span style="color:red">create feature set X (matrix) and vector of labels L</span>

Use [feature extraction methods in scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) to *vectorize* the text of machine learning/AI definitions into an X matrix. 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

L = data["ml"] # labels
corpus = data['definition'] # corpus of reviews in words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)

(107, 494)


### let's look at the words in the corpus of "definition"

In [4]:
print(len(vectorizer.vocabulary_))
print("* * * * *")
print(vectorizer.vocabulary_)

494
* * * * *
{'machine': 241, 'learning': 228, 'is': 210, 'the': 418, 'development': 114, 'of': 284, 'functions': 158, 'made': 243, 'on': 285, 'patterns': 305, 'found': 154, 'in': 193, 'data': 101, 'it': 211, 'training': 442, 'an': 24, 'algorithm': 17, 'how': 185, 'to': 437, 'make': 245, 'predictions': 315, 'computers': 85, 'making': 247, 'decisions': 105, 'and': 26, 'through': 433, 'algorithms': 18, 'trained': 441, 'human': 186, 'inputs': 203, 'creating': 92, 'model': 263, 'teach': 409, 'recognize': 339, 'applying': 29, 'large': 225, 'amount': 23, 'meaningful': 254, 'pattern': 304, 'output': 297, 'from': 156, 'subset': 393, 'practice': 313, 'artificial': 33, 'intelligence': 205, 'which': 478, 'improves': 192, 'based': 45, 'ingesting': 201, 'let': 231, 'read': 335, 'lot': 240, 'conclude': 86, 'program': 325, 'by': 61, 'itself': 213, 'then': 420, 'when': 474, 'encounter': 128, 'with': 486, 'new': 277, 'can': 62, 'automatically': 41, 'deal': 103, 'some': 381, 'are': 30, 'supervised': 39

### <span style="color:red">fit X, L to SVM using gradient descent</span>

[gradient descent documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

In [5]:
# fit SVM linear classifier
from sklearn import linear_model
sgd = linear_model.SGDClassifier()
sgd.fit(X, L)



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

### assess performance

In [6]:
# look at performance measures
import my_measures

performance_measures = my_measures.BinaryClassificationPerformance(sgd.predict(X), L, 'sgd')
performance_measures.compute_measures()
print(performance_measures.performance_measures)

{'Pos': 35, 'Neg': 72, 'TP': 34, 'TN': 71, 'FP': 1, 'FN': 1, 'Accuracy': 0.9813084112149533, 'Precision': 0.9714285714285714, 'Recall': 0.9714285714285714, 'desc': 'sgd'}


![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png)

[source](https://en.wikipedia.org/wiki/Precision_and_recall)

### test model on unseen definitions

In [8]:
# machine learning definitions
# Wikipedia, Expert System, Tech Emergence
ml_defs = ["Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.",
          "Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.",
          "Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions."]

for d in ml_defs:
    print(d)
    print("* * *")

Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.
* * *
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.
* * *
Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.
* * *


In [9]:
# AI definitions
# Wikipedia, Oxford dictionary
ai_defs = ["Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals.",
          "the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages"]

for d in ai_defs:
    print(d)
    print("* * *")

Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals.
* * *
the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages
* * *


In [10]:
# Definitions of unrelated things: kitten, piano, widget
other_defs = ["A kitten, also known as a kitty or kitty cat, is a juvenile cat.",
             "The piano is an acoustic, stringed musical instrument invented in Italy by Bartolomeo Cristofori around the year 1700 in which the strings are struck by hammers.",
             "a small gadget or mechanical device, especially one whose name is unknown or unspecified"]

for d in other_defs:
    print(d)
    print("* * *")

A kitten, also known as a kitty or kitty cat, is a juvenile cat.
* * *
The piano is an acoustic, stringed musical instrument invented in Italy by Bartolomeo Cristofori around the year 1700 in which the strings are struck by hammers.
* * *
a small gadget or mechanical device, especially one whose name is unknown or unspecified
* * *


### function to transform new definitions to a X vector

In [11]:
def get_prediction(definition):
    text_x = vectorizer.transform([definition]).toarray()
    return(sgd.predict(text_x))

### view predicted classifications of new definitions

In [12]:
print("Model predictions for 'machine learning' definitions:")
for mld in ml_defs:
    print(get_prediction(mld))
    
print("* * *")
print("Model predictions for 'AI' definitions:")
for aid in ai_defs:
    print(get_prediction(aid))

print("* * *")
print("Model predictions for other definitions (kitten, piano, widget):")
for otherd in other_defs:
    print(get_prediction(otherd))

Model predictions for 'machine learning' definitions:
[ True]
[False]
[ True]
* * *
Model predictions for 'AI' definitions:
[False]
[False]
* * *
Model predictions for other definitions (kitten, piano, widget):
[False]
[False]
[False]
