<a href="https://colab.research.google.com/github/ralsouza/apache_spark_real_time_analytics/blob/master/notebooks/09_pyspark_mllib_random_forest_with_dimensionality_reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark MLLib - Classification - Random Forest
Description:
*   One of the most popular;
*   It's an Ensemble Method algorithm;
*   The Random Forest algorithm creates many models and each model is used to predict outcomes individually. A vote is made by Random Forest to pick the best model;

Advantages:
*   Usually offers the best performance 
*   Efficient with many predict variables
*   Works well in parallelized way
*   Excellent with missing values

Disadvangates:
* Slower
* BIAS can be occur frequently

Application:
* Scientific research;
* Medical diagnostic;





# Setup

In [None]:
!apt-get update

In [1]:
# Install the dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [2]:
# Environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [3]:
# Make pyspark "importable"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [4]:
# Libraries and Context Setup
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

In [5]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)


# Instance Spark Session
spark = SparkSession.builder.master('local').appName('spark_ml_lib').getOrCreate()

# Create the SQL Context
sqlContext = pyspark.SQLContext(sc)

# Business Problem
### Classify customers according to the possibility of paying the credit or not.

# Libraries

In [6]:
import math
from pyspark.ml.linalg         import Vectors
from pyspark.sql               import Row
from pyspark.ml.feature        import StringIndexer
from pyspark.ml.feature        import PCA
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation     import MulticlassClassificationEvaluator

In [9]:
# Create SparkSession to work with Dataframes on the Spark
sp_session = SparkSession.builder.master('local').appName('spark_mllib_app').getOrCreate()

In [10]:
rdd_bank = sc.textFile('/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/mllib/bank.csv')

In [11]:
rdd_bank.cache()

/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/mllib/bank.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [13]:
rdd_bank.count()

542

In [14]:
rdd_bank.take(5)

['"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"',
 '30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"',
 '33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"yes"',
 '35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"yes"',
 '30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"yes"']

# Data Cleansing