Phases of data modeling - Test, train, deploy
Classification model - where variable predicted (target) is categorical
ad/nonad = binary classification
classification model = "classifier"
Data involved in classification models:
inputs = features used in form of dataframe/matrix
labels - column in dataframe
entities after running classifier = predictive classes, and corresponding confidence
parameters

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
import requests
from bs4 import BeautifulSoup
from sklearn.preprocessing import *

In [None]:
%matplotlib inline

### Access data ###

The first step was to access the Internet Advertisements Data Set from the UCL ML repo: https://archive.ics.uci.edu/ml/datasets/Internet+Advertisements

The internet advertisement data folder contains two files - one which contains the data set and one which contains header names. These two  need to be combined - providing the foundation for the features used predict whether an image is an advertisement ("ad") or not ("nonad")

First, we take the ad data set from the url and and load it into a pandas dataframe

In [None]:
#assign url as string where we will pull data from
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad.data"

#pull down the csv file into a pandas dataframe
ad_df = pd.read_csv(url, header=None, dtype=None)

To get a quick view of the data, we'll use .head() function to preview the first 5 rows of data

In [None]:
ad_df.head()

Next, we need to get meaningful column headers. Since we set 'None' to the header arguement (header='None') the columns have default integer column names.

This time, after we assign the url to a variable, we're going to use the request.get() function to pull non CSV data??? and store the url content as a requests object. As a requests object we can then pull out information from the scraped html using BeatifulSoup package.

In [None]:
#assign url where we will pull data from for column names
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad.names"

#use requests to get url and store information as requests object
response = requests.get(url)

# use BeautifulSoup on content object to pull out relevant content
#lxml = type of html format we are working with
#.text = Get all the child strings and return concatenated using the given separator
#.split('\n') = given separator
soup = BeautifulSoup(response.content, "lxml").text.split('\n')

In [None]:
#print(soup)

To easily view and clean the scraped data we'll create a single column dataframe with a name attribute 

In [None]:
ad_names = pd.DataFrame({'header':soup})
print (ad_names)

### Data Prep

As we can see above, the header values have a number of issues we need to clean up, including: 1) feature terms - we know from the data set description we have 5 groups of feature terms - url, origurl, ancurl, alt, and caption and each of these is mentioned in the data set and need to be removed as they won't be used for headers. 2) empty rows 3) the binary the string contains a "0" and "1" 4) various characters - many of the attributes contain characters which we need to remove in order to have a clean header set ex: "*" "," "."

To achieve cleaning up the values in the header set, we'll create a function called "clean_header" which will take each row as an input and apply a series of strings methods. 

First, we'll want to remove all characters after ":" to separate the header from the potential values listed, 1 or 0. # https://stackoverflow.com/questions/1178335/in-python-2-4-how-can-i-strip-out-characters-after

Second, by looking at the column of data we see feature terms are indicated by a "|". We'll use this to identify those values we want to remove.

Third, we'll replace characters "*" and "+" with "_" to make the data easier to read. 

In [None]:
#clean up the values in the column
def clean_header(row):
     return str(row.split(":")[0].split("|")[0].replace("*","_").replace("+","_").strip())      

    #save cleaned ad_names
ad_names['header'] = ad_names['header'].apply(clean_header)

In [None]:
ad_names.head()

To determine how to best handle the empty cells we check if they contain null values to drop or empty strings to inform next steps 

In [None]:
#check to see if empty strings?
ad_names['header'].isnull().sum()

Now that we know these aren't null values, we'll convert the strings to np.nan objects using replace(), and then call dropna()on the DataFrame to delete rows with null tenants.

In [None]:
ad_names['header'].replace('', np.nan, inplace=True)

#
ad_names.dropna(subset=['header'], inplace=True)

#reset index
ad_names = ad_names.reset_index(drop=True)

In [None]:
ad_names.head()

We know from the dataset we pulled into the dataframe above that the predictor (ad, nonad) is the last column of data; however, it appears as the first value in this header list. We'll move this to end of the header column.

In [None]:
#drop first row
ad_names = ad_names.drop([0])

#reset index
ad_names = ad_names.reset_index(drop=True)

#assign value
ad_names.set_value(1558,'header','ad, nonad')

Now that we have a complete list of column headers, we need to apply them to the data set.

We'll turn our single column dataframe into a list then assign the list as column names, replacing the default integers (0-1558).

In [None]:
#create a column name list
ad_columns = ad_names["header"].tolist()

#assign column names to dataframe
ad_df.columns = ad_columns

Now that we've combined the data set with the column headers into a table, we'll begin exploring and preparing the data to start the process of identifying and replacing outliers, dealing with null and missing values,

## Exploratory Data Analysis

At first glance, the dataset looks clean

In [None]:
ad_df.head()

All of the columns look numeric; however, we need to determine they are before we can start using them to...

By using dtypes.value_counts() we can quickly determine what datatypes we have and how many. With 1558 attributes, this will help quickly identify potential issues.

In [None]:
#count data types
ad_df.dtypes.value_counts()

From the data set description we know there are 1558 attributes - three continous with all others are binary. 

To prepare the data set we're going to first determine the data types, determine what the appropriate data type should be, and coerce them to numeric, if necessary.

We expected 1 of the 1559 attributes to be an object as this column tells us which instance is an ad or nonad. The three continuous attributes ('height', 'width', 'aratio') are expected to be numeric; however, are most likely listed as 'object' due to the missing values discussed in the data set description.

We also know that one or more of the three continous features are missing in 28% of the instances.

To address these issues, we will need to confirm which attributes have the 'object' datatype.

In [None]:
ad_df.select_dtypes(include='object').columns

In addition to 'height', 'width', 'aratio', and 'ad, nonad', the "local" column appears to also be an object datatype. 

We can use .unique() to determine the values in this column to determine why it is appearing as an object.

In [None]:
print(ad_df.loc[:,"local"].unique())

The placeholder '?' is causing this column to appear as an 'object' datatype. 

For this and the three attributes which are supposed to be continuous we're going to replace the missing values and ensure they are numeric. 

First, we'll create the non-numeric values using the pd.to_numeric function, then replace those nan values with imputed values from the dataset

In [None]:
# Coerce data type to numeric
# errors='coerce' tells function to turn string into nan if it cannot be turned into a number
ad_df.loc[:, "height"] = pd.to_numeric(ad_df.loc[:, "height"], errors='coerce')
ad_df.loc[:, "width"] = pd.to_numeric(ad_df.loc[:, "width"], errors='coerce')
ad_df.loc[:, "aratio"] = pd.to_numeric(ad_df.loc[:, "aratio"], errors='coerce')
ad_df.loc[:, "local"] = pd.to_numeric(ad_df.loc[:, "local"], errors='coerce')

#Determine the location of nans
HasNanH = np.isnan(ad_df.loc[:,"height"])
HasNanW = np.isnan(ad_df.loc[:,"width"])
HasNanA = np.isnan(ad_df.loc[:,"aratio"])
HasNanL = np.isnan(ad_df.loc[:,"local"])

#impute median from valumn and apply to nan vlaues
ad_df.loc[HasNanH, "height"] = np.nanmedian(ad_df.loc[:,"height"])
ad_df.loc[HasNanW, "width"] = np.nanmedian(ad_df.loc[:,"width"])
ad_df.loc[HasNanA, "aratio"] = np.nanmedian(ad_df.loc[:,"aratio"])
ad_df.loc[HasNanL, "local"] = np.nanmedian(ad_df.loc[:,"local"])

In [None]:
#check data type counts
ad_df.dtypes.value_counts()

Next, we'll visualize the data using a histogram to understand the overview of the numeric distribution.

In [None]:
plt.hist(ad_df.loc[:, "height"])

In [None]:
plt.hist(ad_df.loc[:, "width"])

In [None]:
plt.hist(ad_df.loc[:, "aratio"])

These visuals indicate that we have outliers at the high end of distributions. We'll 

In [None]:
## The high limit for acceptable values is the mean plus 2 standard deviations
LimitHi = ad_df.loc[:, "height"].mean() + 2*(ad_df.loc[:, "height"].std())
LimitHi = ad_df.loc[:, "aratio"].mean() + 2*(ad_df.loc[:, "aratio"].std())

#Replace outliers
TooHigh = ad_df.loc[:, "height"] > LimitHi
TooHigh = ad_df.loc[:, "aratio"] > LimitHi

ad_df.loc[TooHigh, "height"] = LimitHi
ad_df.loc[TooHigh, "aratio"] = LimitHi

In [None]:
plt.hist(ad_df.loc[:, "height"])

In [None]:
plt.hist(ad_df.loc[:, "aratio"])

In [None]:
ad_df.head()

In [None]:
#remove . from "ad nonad" column
ad_df["ad nonad"] = ad_df["ad nonad"].map(lambda x: str(x)[:-1])

In [None]:
# plot the counts for each category
ad_df.loc[:,"ad nonad"].value_counts().plot(kind='bar')

In [None]:
#create new numeric colmns
ad_df.loc[:,"ad"] = (ad_df.loc[:,"ad nonad"] == "ad").astype(int)
ad_df.loc[:,"nonad"] = (ad_df.loc[:,"ad nonad"] == "nonad").astype(int)

In [None]:
# Remove obsolete column "ad nonad"
ad_df = ad_df.drop("ad nonad", axis=1)

### Normalization

In [None]:
#columns to apply z-normalization aka standardization
p = ad_df[['height','width','aratio']]

In [None]:
#standardization - change the variable so that it’s mean is equal to 0.0 and its standard dev is equal to 1.0
standardization_scale = StandardScaler().fit(p)

In [None]:
z = standardization_scale.transform(p)

In [None]:
hc_scaled = pd.DataFrame(z)

In [None]:
ad_df[['height','width','aratio']] = hc_scaled

In [None]:
#Drop 'nonad' column. 'ad' will be the target
ad_df = ad_df.drop("nonad", axis=1)

In [228]:
ad_df.head()

Unnamed: 0,height continuous,width continuous,aratio continuous,local,url_images+buttons,url_likesbookscom,url_wwwslakecom,url_hydrogeologist,url_oso,url_media,...,caption_home,caption_my,caption_your,caption_in,caption_bytes,caption_here,caption_click,caption_for,caption_you,ad
0,1.817265,-0.158911,-0.804119,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,-0.025925,2.888693,1.96735,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,-0.676463,0.774029,1.49039,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0.055392,2.888693,1.809568,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0.055392,2.888693,1.809568,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### Export data

In [229]:
#dataframe to csv
#ad_df.to_csv('InternetAd_Dataset.csv', index=None)

In [2]:
from sklearn.model_selection import train_test_split

In [5]:
#data = pd.read_csv('InternetAd_Dataset.csv')

In [6]:
y = data.ad
X = data.drop('ad', axis=1)

NameError: name 'data' is not defined

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
print "\nX_train:\n"
print(X_train.head())
print X_train.shape
print "\nX_test:\n"
print(X_test.head())
print X_test.shape

### Data Modeling

In [230]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 

In [231]:
def split_dataset(data, r): # split a dataset
	N = len(data)
	X = []
	Y = []
	
	if r >= 1: 
		print ("Parameter r needs to be smaller than 1!")
		return
	elif r <= 0:
		print ("Parameter r needs to be larger than 0!")
		return

	n = int(round(N*r)) # number of elements in testing sample
	nt = N - n # number of elements in training sample
	ind = -np.ones(n,int) # indexes for testing sample
	R = np.random.randint(N) # some random index from the whole dataset
	
	for i in range(n):
		while R in ind: R = np.random.randint(N) # ensure that the random index hasn't been used before
		ind[i] = R

	ind_ = list(set(range(N)).difference(ind)) # remaining indexes	
	X = data[ind_,:-1] # training features
	XX = data[ind,:-1] # testing features
	Y = data[ind_,-1] # training targets
	YY = data[ind,-1] # testing targests
	return X, XX, Y, YY

In [232]:
r = 0.2 # ratio of test data over all data (this can be changed to any number between 0.0 and 1.0 (not inclusive)
dataset = np.genfromtxt('InternetAd_Dataset.csv', delimiter=",", skip_header=1)
X, XX, Y, YY = split_dataset(dataset, r)

In [233]:
""" CLASSIFICATION MODELS """
# Logistic regression classifier
print ('\n\n\nLogistic regression classifier\n')
C_parameter = 50. / len(X) # parameter for regularization of the model
class_parameter = 'ovr' # parameter for dealing with multiple classes
penalty_parameter = 'l1' # parameter for the optimizer (solver) in the function
solver_parameter = 'saga' # optimization system used
tolerance_parameter = 0.1 # termination parameter




Logistic regression classifier



In [234]:
#Training the Model
clf = LogisticRegression(C=C_parameter, multi_class=class_parameter, penalty=penalty_parameter, solver=solver_parameter, tol=tolerance_parameter)
clf.fit(X, Y) 
print ('coefficients:')
print (clf.coef_) # each row of this matrix corresponds to each one of the classes of the dataset
print ('intercept:')
print (clf.intercept_) # each element of this vector corresponds to each one of the classes of the dataset

# Apply the Model
print ('predictions for test set:')
print (clf.predict(XX))
print ('actual class values:')
print (YY)

coefficients:
[[0.         0.80858795 0.00123179 ... 0.         0.         0.        ]]
intercept:
[-2.2999795]
predictions for test set:
[1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0

In [235]:
# Naive Bayes classifier
print ('\n\nNaive Bayes classifier\n')
nbc = GaussianNB() # default parameters are fine
nbc.fit(X, Y)
print ("predictions for test set:")
print (nbc.predict(XX))
print ('actual class values:')
print (YY)



Naive Bayes classifier

predictions for test set:
[0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1.
 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0.
 1. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0.
 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1.
 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.
 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0.
 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.
 1. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0.

In [236]:
# k Nearest Neighbors classifier
print ('\n\nK nearest neighbors classifier\n')
k = 5 # number of neighbors
distance_metric = 'euclidean'
knn = KNeighborsClassifier(n_neighbors=k, metric=distance_metric)
knn.fit(X, Y)
print ("predictions for test set:")
print (knn.predict(XX))
print ('actual class values:')
print (YY)



K nearest neighbors classifier

predictions for test set:
[1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.
 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

In [237]:
# Support vector machine classifier
t = 0.001 # tolerance parameter
kp = 'rbf' # kernel parameter
print ('\n\nSupport Vector Machine classifier\n')
clf = SVC(kernel=kp, tol=t)
clf.fit(X, Y)
print ("predictions for test set:")
print (clf.predict(XX))
print ('actual class values:')
print (YY)
####################



Support Vector Machine classifier

predictions for test set:
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

In [238]:
# Decision Tree classifier
print ('\n\nDecision Tree classifier\n')
clf = DecisionTreeClassifier() # default parameters are fine
clf.fit(X, Y)
print ("predictions for test set:")
print (clf.predict(XX))
print ('actual class values:')
print (YY)
####################



Decision Tree classifier

predictions for test set:
[1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.
 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 

In [239]:
# Random Forest classifier
estimators = 10 # number of trees parameter
mss = 2 # mininum samples split parameter
print ('\n\nRandom Forest classifier\n')
clf = RandomForestClassifier(n_estimators=estimators, min_samples_split=mss) # default parameters are fine
clf.fit(X, Y)
print ("predictions for test set:")
print (clf.predict(XX))
print ('actual class values:')
print (YY)
####################



Random Forest classifier

predictions for test set:
[1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.
 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 