# Machine Learning With Big Data
## by University of California, San Diego

### Week 2

#### 1. Review the scripts:

**doweathclass.py** will create Weather data for predicting if someone will *Play* tennis. The data is hard coded as a list of lists, put into a dataframe and then mapped into an RDD of labeled point vectors. Notice that the categorical variables are recoded as binary indicator variable (eg outlook='sunny' 'overcast' or 'rainy' is replace by 3 variables sunny (1 or 0), overcast (1 or 0), rainy (1 or 0) ).

**doweathclass_naivebayes.py** will execute a Naive Bayes classifier on the RDD of labeled points. There is also some lines of code to get a confusion matrix.

**doweathclass_dectree.py** will execute a Decision Tree classifier on the RDD, and produce a confusion matrix as well.

#### 2. Execute Naive Bayes classification and the NaiveBayes script:

You should see the confusion matrix printed out and the percent correct.

* What is the approximate percent correct? (save the answer for the quiz)
* Look at the confusion matrix and write down the numbers for the quiz.

#### 3. Run the decision tree script:

Look at the confusion matrix that is output, save the numbers.

The decision tree function returns a decision tree object. 
* Review the code. 
* Check out the object. 
* Enter just the object name.
* Observe how the number of nodes is related to the decision tree.

In [2]:
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

# outlook, temperature, humidity, windy, play, copied from Weka's data example
rawdata=[
  ['sunny'    ,85,85,'FALSE' ,0],
  ['sunny'    ,80,90,'TRUE'  ,0],
  ['overcast' ,83,86,'FALSE' ,1],
  ['rainy'    ,70,96,'FALSE' ,1],
  ['rainy'    ,68,80,'FALSE' ,1],
  ['rainy'    ,65,70,'TRUE'  ,0],
  ['overcast' ,64,65,'TRUE'  ,1],
  ['sunny'    ,72,95,'FALSE' ,0],
  ['sunny'    ,69,70,'FALSE' ,1],
  ['rainy'    ,75,80,'FALSE' ,1],
  ['sunny'    ,75,70,'TRUE'  ,1],
  ['overcast' ,72,90,'TRUE'  ,1],
  ['overcast' ,81,75,'FALSE' ,1],
  ['rainy'    ,71,91,'TRUE'  ,0]
]
data_df=sqlContext.createDataFrame(rawdata,['outlook','temp','humid','windy','play'])

# transform categoricals into indicator variables
out2index={'sunny':[1,0,0],'overcast':[0,1,0],'rainy':[0,0,1]}

# make RDD of labeled vectors
def newrow(dfrow):
    outrow = list(out2index.get((dfrow[0])))  # get dictionary entry for outlook
    outrow.append(dfrow[1])   # temp
    outrow.append(dfrow[2])   # humidity
    if dfrow[3]=='TRUE':      # windy
        outrow.append(1)
    else:
        outrow.append(0)
    return (LabeledPoint(dfrow[4],outrow))

datax_rdd=data_df.map(newrow)

In [3]:
from pyspark.mllib.classification import NaiveBayes

# execute model, it can go in a single pass
my_nbmodel = NaiveBayes.train(datax_rdd)

# some info on model 
print my_nbmodel
# some checks,get some of training data and test it:
datax_col=datax_rdd.collect()   # if datax_rdd was big, use sample or take

trainset_pred =[]
for x in datax_col:
    trainset_pred.append(my_nbmodel.predict(x.features))

print trainset_pred

# to see class conditionals you might have to install scipy
# import scipy
# print 'Class Cond Probabilities, ie p(attr|class= 0 or 1) '
# print scipy.exp(my_nbmodel.theta)
# print scipy.exp(my_nbmodel.pi)

# get a confusion matrix
# the row is the true class label 0 or 1, columns are predicted label
nb_cf_mat=np.zeros([2,2])  # num of classes
for pnt in datax_col:
    predctn = my_nbmodel.predict(np.array(pnt.features))
    nb_cf_mat[pnt.label][predctn]+=1

corrcnt=0
for i in range(2):
    corrcnt+=nb_cf_mat[i][i]
nb_per_corr=corrcnt/nb_cf_mat.sum()
print 'Naive Bayes: Conf.Mat. and Per Corr'
print nb_cf_mat
print nb_per_corr

In [4]:
from pyspark.mllib.tree import DecisionTree
dt_model = DecisionTree.trainClassifier(datax_rdd,2,{},impurity='entropy',
          maxDepth=3,maxBins=32, minInstancesPerNode=2)  

# maxDepth and maxBins
# {} could be categorical feature list,
# to do regression, have no numclasses,and use trainRegression function
print dt_model.toDebugString()

# results in this:
# DecisionTreeModel classifier of depth 3 with 9 nodes
#   If (feature 1 <= 0.0)
#    If (feature 4 <= 80.0)
#     If (feature 3 <= 68.0)
#      Predict: 0.0
#     Else (feature 3 > 68.0)
#      Predict: 1.0
#    Else (feature 4 > 80.0)
#     If (feature 0 <= 0.0)
#      Predict: 0.0
#     Else (feature 0 > 0.0)
#      Predict: 0.0
#   Else (feature 1 > 0.0)
#    Predict: 1.0

# notice number of nodes are the predict (leaf nodes) and the ifs
           
# some checks, get some of training data and test it:
datax_col=datax_rdd.collect()   # if datax_rdd was big, use sample or take

# redo the conf. matrix code (it would be more efficient to pass a model)
dt_cf_mat=np.zeros([2,2])  # num of classes
for pnt in datax_col:
    predctn = dt_model.predict(np.array(pnt.features))
    dt_cf_mat[pnt.label][predctn]+=1
corrcnt=0
for i in range(2): 
    corrcnt+=dt_cf_mat[i][i]
dt_per_corr=corrcnt/dt_cf_mat.sum()
print 'Decision Tree: Conf.Mat. and Per Corr'
print dt_cf_mat
print dt_per_corr

### Further actions:

#### 6. Let's try adding a useless variable. 

In the **doweathclass.py** script add a new variable that is constant, after the 'play' column, as follows (it is short enough to edit by hand):

```python
rawdata=[
  ['sunny'    ,85,85,'FALSE' ,0,1],  
  ['sunny'    ,80,90,'TRUE'  ,0,1],
  ['overcast' ,83,86,'FALSE' ,1,1],
  ['rainy'    ,70,96,'FALSE' ,1,1],
...etc
```

Now to process this rawdata list I have to change the rest of code a bit as follows. First change the dataframe creation to include a 'mydummy' field:

```python
data_df=sqlContext.createDataFrame(rawdata, ['outlook','temp','humid','windy','play','mydummy']) # <--add field
```

Next, change the function that creates labeled points. The function should have 1 more line of code (see the 'add this' marked line) that just includes the new column in the labeled point vector.

```python
#make RDD of labeled vectors
def newrow(dfrow):    
    outrow = list(out2index.get((dfrow[0])))  #get dictionary entry
    outrow.append(dfrow[1])   #temp    
    outrow.append(dfrow[2])   #humidity    
    if dfrow[3]=='TRUE':      #windy        
        outrow.append(1)    
    else:        
        outrow.append(0)    
    outrow.append(dfrow[5])  # <---- add this     
    return (LabeledPoint(dfrow[4],outrow))
```

Now rerun the NaiveBayes and DecisionTree scripts, and observe the impact on the percent correct.

#### 7. Lets add a new test point.

Create an numpy array with the following values:

```python
# this for the outlook binary indicator variables, sunny, overcast, rainy
# temperature = 68
# humidy = 79
# windy = 0
# useless-constant dummy variable = 1
newpoint  = np.array([1, 0, 0, 68, 79, 0, 1])
```

Now run that test point through the NaiveBayes and DecisionTree model using the predict functions.

```python
my_nbmodel.predict( ....)
dt_model.predict( ... )
```
You should see that the NaiveBayes and Decision Tree gave different answers. It is very difficult to say exactly why the answer are different, without recreating the entire algorithm calculations. However, we can do a quick analysis as follows;

Print the decision tree out and look at how the test point is labeled in that tree. For example, observe which variables are used to get the prediction node for this test point.