# Predict Using Random Forest Classifier

**Libraries we are going to use** *scikit-learn* machine learning library.

We are using scikit-learn Random Forest Classifier to predict, if a particular student has already completed **test preparation course** using *math, reading score,* and	*writing scores* given.


**Import required libraries**

In [None]:
# plotting
import pandas as pd # pandas
import numpy as np # numpy

# preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

# classifier
from sklearn.ensemble import RandomForestClassifier

# other
from sklearn.model_selection import train_test_split

**Read the data from csv file**

Read CSV is used to CSV (Comma-separated values) file into a DataFrame. It has several useful parameters.
<code> filepath_or_buffer</code>  is the path to the CSV file. <code>header</code> is the header column row ID

<p>In  here <code> filepath_or_buffer</code> is <b>../input/StudentsPerformance.csv</b> written inside two double quotaion marks. <code>header</code> is deafult to **0**.</p>.

In [None]:
# read csv
data_frame = pd.read_csv(
    filepath_or_buffer = "../input/StudentsPerformance.csv", # file path of csv
    header = 0, # header row
)

**Read top few rows from the file**

<code>head()</code> Row number(s) to use as the column names, and the start of the data. <code>head()</code> is **10** here, because we we want to read top 1o rows.

In [None]:
data_frame.head(10) # top 10 rows from csv

**Check missing values**

.<code>isnull()</code> and <code>sum()</code> is used to find whether there are any missing values in the CSV file.

In [None]:
data_frame.isnull().sum() # checking missing values

**Creating a new data frame for data points**

<code>usecols</code> is used to return a **subset of the columns**. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [5, 6, 7] or [‘math score’, ‘reading scor’, ‘writing score’].

In [None]:
# get writing, reading, and math scores for a separate data frame
ML_DataPoints = pd.read_csv(
    filepath_or_buffer = "../input/StudentsPerformance.csv", # file path of csv
    header = 0, # header row
    usecols = ['math score',
               'reading score',
               'writing score'] # data points columns
)

**Creating another data frame for labels**

In [None]:
# get test preparation course values
ML_Labels = pd.read_csv(
    filepath_or_buffer = "../input/StudentsPerformance.csv", # file path of csv
    header = 0, # header row
    usecols = ['test preparation course'] # data points labels
)

**Load MinMaxScaler**

Transforms features by s*caling each feature to a given range*.

<p>This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.</p>

In [None]:
# min max scaler
MNScaler = MinMaxScaler()
MNScaler.fit(ML_DataPoints) # fit math, reading, and writing scores
T_DataPoints = MNScaler.transform(ML_DataPoints) # transform the scores

**Load LabelEncoder**

Encode labels in the column *test preparation course* with value between 0 and 1. We are using RandomForestClassifier as a bianry classification because, there are only two known classes in test preparation course as *none* and *completed*.

In [None]:
# label encoder
LEncoder = LabelEncoder()
LEncoder.fit(ML_Labels) # fit labels
T_Labels = LEncoder.transform(ML_Labels) # transform the labels

**Split train test data set**

Split arrays or matrices into random *train* and *test* subsets.

In [None]:
# split train test data set
XTrain, XTest, YTrain, YTest = train_test_split(T_DataPoints, T_Labels, random_state = 10)

## Random Forest Classifier

**Random Forest Classifier**

<p>A random forest is* a meta estimator* that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.</p>

<code>n_estimators</code> <i>integer, optional (default=10)</i>  The number of trees in the forest.<br/>
<code>random_state</code> <i>int, RandomState instance or None, optional (default=None)</i> If int, random_state is the seed used by the random number generator.</bt>


In [None]:
RandomForest = RandomForestClassifier(
    n_estimators = 10,
    random_state = 3
) # load the classifier

**Fit XTrain and YTrain**

<code>fit(()</code>Build a forest of trees from the training set (X, y).

In [None]:
RandomForest.fit(XTrain, YTrain) # fit data points and labels

<code>score()</code> Returns the mean accuracy on the given test data and labels.

In [None]:
RandomForest.score(XTrain, YTrain)

In [None]:
RandomForest.score(XTest, YTest)

In here we created a custom <code>numpy array</code> to test the model. The data points we created in here are is from training data test. The array we used to create has <code>array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])</code> of labels.

In [None]:
data_points= np.array ([
    [72, 72, 74], [90, 95, 93], [47, 57, 44], [76, 78, 75], [71,83, 78], #none --> 1
    [69, 90, 88], [88, 95, 92], [64, 64, 67], [78, 72, 70], [46, 42, 46] # completed --> 0

])

Preprocess the numpy array.

In [None]:
T_Points = MNScaler.transform(data_points)

**Predict the tranformed data points**

In [None]:
RandomForest.predict(T_Points)

The predicted array list <code>array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0])</code>. Only one data point is left inaccurate according to the model we have created.
<p>This model is <code>overfitted</code> as you can see scored train data sets and test data sets.