# Scikit-Learn



*   It is a toolkit which has way better models than the NLTK. NLTK has only implemeted MaxentClassifier, DecisionTreeClassifier and NaiveBayesClassifier but Scikit offers much more models to play with.

*   Has parllelization capabilites, built on top of NUMPY, provides API for tasks such as pre-processing of  data, vectorization and training models.




```
# To install it on your local machine use 
 pip3 install scikit-learn
 
```






In [6]:
# importing required libraries and toy datset to demonstrate
import nltk
nltk.download('names')
from nltk.corpus import names
from sklearn.model_selection import train_test_split

# This is a function to extract features from the dataset
def feature_extract(name):
  return {
      'last_letter' : name [-1]
  }

# get names
boy_names = names.words('male.txt')
girl_names = names.words('female.txt')
 
# Build the dataset
boy_names_dataset = [(feature_extract(name), 'boy') for name in boy_names]
girl_names_dataset = [(feature_extract(name), 'girl') for name in girl_names]
# Put all the names together
data = boy_names_dataset + girl_names_dataset
# Split the data in features and classes
X, y = list(zip(*data))
# split and randomize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, shuffle=True)
print(X_train)
print(y_train)

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!
[{'last_letter': 'e'}, {'last_letter': 'd'}, {'last_letter': 'e'}, {'last_letter': 'e'}, {'last_letter': 'h'}, {'last_letter': 'e'}, {'last_letter': 'o'}, {'last_letter': 'a'}, {'last_letter': 'a'}, {'last_letter': 'a'}, {'last_letter': 'e'}, {'last_letter': 'a'}, {'last_letter': 'i'}, {'last_letter': 'n'}, {'last_letter': 'a'}, {'last_letter': 'e'}, {'last_letter': 'y'}, {'last_letter': 'n'}, {'last_letter': 'a'}, {'last_letter': 'y'}, {'last_letter': 'e'}, {'last_letter': 'n'}, {'last_letter': 's'}, {'last_letter': 'a'}, {'last_letter': 'n'}, {'last_letter': 'e'}, {'last_letter': 'h'}, {'last_letter': 'e'}, {'last_letter': 'e'}, {'last_letter': 'a'}, {'last_letter': 'e'}, {'last_letter': 'e'}, {'last_letter': 's'}, {'last_letter': 'n'}, {'last_letter': 'a'}, {'last_letter': 'r'}, {'last_letter': 'y'}, {'last_letter': 'p'}, {'last_letter': 'e'}, {'last_letter': 'e'}, {'last_l

### Implementing the same using Scikit-learn API

*   There are two main components in scikit-learn
1. Transformers --> Method used for converitng the format of the data
2. Predictors --> Classification method for a data point
other important methods include fit, score used for training and evaluation.


In [0]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
dict_vectorizer = DictVectorizer()
name_classifier = DecisionTreeClassifier()
# Scikit-Learn models work with arrays not dicts
# We need to train the vectorizer so that
# it knows what's the format of the dicts
dict_vectorizer.fit(X_train)
# Vectorize the training data
X_train_vectorized = dict_vectorizer.transform(X_train)
# Train the classifier on vectorized data
name_classifier.fit(X_train_vectorized, y_train)
# Test the model
X_test_vectorized = dict_vectorizer.transform(X_test)

In [0]:
NAMES = ['Lara', 'Carla', 'Ioana', 'George', 'Steve', 'Stephan']
transformed = dict_vectorizer.transform([feature_extract(name) for name in NAMES])

In [10]:
# this is now in the form of matrix for the above given names
print(transformed)

  (0, 1)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 5)	1.0
  (4, 5)	1.0
  (5, 14)	1.0


In [11]:
# to check the data type
print(type(transformed))

<class 'scipy.sparse.csr.csr_matrix'>


In [12]:
# if we want, we can perform an inverse operation
print(dict_vectorizer.inverse_transform(transformed))

[{'last_letter=a': array(1.)}, {'last_letter=a': array(1.)}, {'last_letter=a': array(1.)}, {'last_letter=e': array(1.)}, {'last_letter=e': array(1.)}, {'last_letter=n': array(1.)}]


In [13]:
# we can check the naems of the features
print(dict_vectorizer.feature_names_)


['last_letter= ', 'last_letter=a', 'last_letter=b', 'last_letter=c', 'last_letter=d', 'last_letter=e', 'last_letter=f', 'last_letter=g', 'last_letter=h', 'last_letter=i', 'last_letter=j', 'last_letter=k', 'last_letter=l', 'last_letter=m', 'last_letter=n', 'last_letter=o', 'last_letter=p', 'last_letter=r', 'last_letter=s', 'last_letter=t', 'last_letter=u', 'last_letter=v', 'last_letter=w', 'last_letter=x', 'last_letter=y', 'last_letter=z']


In [15]:
# Now lets try to make some predictions
NAMES = ['Lara', 'Carla', 'Ioana', 'George', 'Steve', 'Stephan']
transformed = dict_vectorizer.transform([feature_extract(name) for name in NAMES])
print(name_classifier.predict(transformed))

['girl' 'girl' 'girl' 'girl' 'girl' 'boy']
