# <a name="gotop"></a>Table of Contents
[Points to Remember](#points)

[Numpy](#startingnumpy)
   + [Advantages of Numpy](#numpyadvantages)
   + [Creating Array](#array)
   + [Coverting list to array using numpy](#listtoarray)
   + [Array Indexing](#arrayindexing)
   + [Array Slicing](#arrayslicing)
   + [Array Concatenation](#arrayconcatenation)

[Pandas](#pandas)

[Exploring ML Dataset](#exploring)

[Building a ML Model using Random Forest](#building)
+ [Building the model](#buildingthemodel)
+ [Predicting the accuracy](#prediction)
#Introduction to Numpy and Pandas

The pandas library has emerged into a power house of data manipulation tasks in python since it was developed in 2008. With its intuitive syntax and flexible data structure, it's easy to learn and enables faster data computation. The development of numpy and pandas libraries has extended python's multi-purpose nature to solve machine learning problems as well. The acceptance of python language in machine learning has been phenomenal since then.

In this tutorial, we'll learn about using numpy and pandas **libraries for data manipulation** from scratch. Instead of going into theory, we'll take a practical approach. First, we'll understand the syntax and commonly used functions of the respective libraries. Later, we'll work on a real-life data set.

**Note**: This tutorial is best suited for people who know the basics of python. No further knowledge is expected. 

##<a name="points"></a>Points to be Remembered

+ The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.

+ Pandas is best at handling tabular data sets comprising different variable types (integer, float, double, etc.). In addition, the pandas library can also be used to perform even the most naive of tasks such as loading data or doing feature engineering on time series data.

+ Numpy is most suitable for performing basic numerical computations such as mean, median, range, etc. Alongside, it also supports the creation of multi-dimensional arrays.

+ Numpy library can also be used to integrate C/C++ and Fortran code.

+ Remember, python is a zero indexing language unlike R where indexing starts at one.

+ The best part of learning pandas and numpy is the strong active community support you'll get from around the world.

## Environment to be used
At present, we can use [Google Colaboratory](https://colab.research.google.com) for trying out what we are learning. Guess what's the advantage,
+ No installation needed
+ Easily accessible online
+ Pre-installed packages

**Note:** If you are about to use a data set that you would like to upload to colab, errors might occur on chrome, try using Firefox if any error occurs.



# <a name="startingnumpy"></a>Starting off with Numpy


**Numpy** is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The ancestor of NumPy was Numeric. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. 

**<a name="numpyadvantages"></a>Advantages of Numpy**

It contains among other things:

+ a powerful N-dimensional array object
+ sophisticated (broadcasting) functions
+ tools for integrating C/C++ and Fortran code
+ useful linear algebra, Fourier transform, and random number capabilities
+ Size - Numpy data structures take up less space
+ Performance - they have a need for speed and are faster than lists
+ Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built in.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.




In [2]:
#After opening a new project on Google colab, let's first import Numpy. 

#load the library and check its version, just to make sure we aren't using an older version
import numpy as np
np.__version__
#the output of the above command shows the version version of numpy available on Colab

'1.15.2'

Next, we can create a list of numbers ranging from 0 to 50, to play with.

In [None]:
# we are using the list and range function to create the list of numbers
# Create a list of numbers from 0 to 50.

#The above two lines create and prints a list of numbers ranging from 0 to 50

Next, we can convert the integers to string. Converting integers to string / this style of handling lists is known as **list comprehension.**
List comprehension offers a versatile way to handle list manipulations tasks easily. We'll learn about them in other examples. Here's an example.  

In [None]:
# Use list comprehension to create a list by converting the list values to string.
str_nums = 
# Use list comprehension to create a list with the elements of the type of the list.
type_list = 
print("String List: ", str_nums[:5], "Type List: ", type_list[:5])

### <a name="array"></a>Creating Arrays
Numpy arrays are homogeneous in nature, i.e., they comprise one data type (integer, float, double, etc.) unlike lists.
The following examples shows how to create:
+ array with predefined values
+ array with a set sequence
+ array of even space between the given range of values
+ identity matrix

It's pretty much simple. Take a look at the example below: 

In [None]:
# We can create an array of zeroes using numpy directly.
zeros_array = 
print("Array with all zeros: \n\n", zeros_array, "\n")


# Here we are creating a 3 row x 5 column matrix
ones_array = 
print("Array with all ones: \n\n", ones_array, "\n")


In [None]:
# creating a matrix with a predefined value
using_full = 
print(using_full)

In [None]:

#create an array with a set sequence
using_arrange = 
print(using_arrange)


In [None]:
#create an array of even space between the given range of values
using_linspace = 
print(using_linspace)

In [None]:
#create a 3x3 array with mean 0 and standard deviation 1 in a given dimension
a = 
print(a)

In [None]:
#create an identity matrix
id_matrix = 
print(id_matrix)

In [None]:
#setting a random seed
np.random.seed(0)


x1 =  #one dimension
x2 =  #two dimension
x3 =  #three dimension

print(x1, x2, x3)

# Printing the various attributes
print("x3 ndim:", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

### <a name="listtoarray"></a>Converting lists to array using Numpy

One of the main advantage of Numpy is the fact that, lists can be easily coverted to arrays using numpy. 

In the below code, we can convert the numbers list created to an array using Numpy


In [None]:
numbers = [10, 20, 30, 40, 50, 60]
numbers_array = # Create the array from list.
numbers_array

Next, we can move on to Array Indexing

## <a name="arrayindexing"></a>Array Indexing

The important thing to remember is that indexing in python starts at zero.



In [None]:
x1 = np.array([4, 3, 4, 4, 8, 4])
x1

In [None]:
#assess value to index zero


In [None]:
#assess fifth value


In [None]:
#get the last value


In [None]:
#get the second last value


In [None]:
##in a multidimensional array, we need to specify row and column index
x2 = np.array([[3, 7, 5, 5],
      [0, 1, 5, 9],
      [3, 0, 5, 0]])

In [None]:
##1st row and 2nd column value


In [None]:
#3rd row and last value from the 3rd column


In [None]:
#replace value at 0,0 index

x2

## <a name="arrayslicing"></a>Array Slicing
Now, we'll learn to access multiple or a range of elements from an array. Contents of ndarray object can be accessed and modified by indexing or slicing, just like Python's in-built container objects.

As mentioned earlier, items in ndarray object follows zero-based index. Three types of indexing methods are available − field access, basic slicing and advanced indexing.

Basic slicing is an extension of Python's basic concept of slicing to n dimensions. A Python slice object is constructed by giving start, stop, and step parameters to the built-in slice function. This slice object is passed to the array to extract a part of array.

In [None]:
#first lets create an array in the range of 10
x = np.arange(10)
x

In [None]:
#from start to 4th position


In [None]:
#from 4th position to end


In [None]:
#from 4th to 6th position


In [None]:
#return elements at even place


In [None]:
#return elements from first position step by two


In [None]:
#reverse the array


## <a name="arrayconcatenation"></a>Array Concatenation
Many a time, we are required to combine different arrays. So, instead of typing each of their elements manually, you can use array concatenation to handle such tasks easily, OR 
Often you may have two or more NumPY arrays and want to concatenate/join/merge them into a single array. Python offers multiple options to join/concatenate NumPy arrays.

Common operations include given two 2d-arrays, how can we concatenate them row wise or column wise. NumPy’s concatenate function allows you to concatenate two arrays either by rows or by columns. Let us see a couple of examples of NumPy’s concatenate function.

In [None]:
#let's start by creating 2 simple arrays
#You can concatenate two or more arrays at once.
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
concatenated_arr = # Concatenate

In [None]:
#You can also use this function to create 2-dimensional arrays.
grid = np.array([[1,2,3],[4,5,6]])
## Your code here

In [None]:
#Use its axis parameter to define row-wise or column-wise matrix


#### Also, we can split the arrays based on pre-defined positions. Let's see how!

In [None]:
#creating an array in the range of 10
x = np.arange(10)
x

In [None]:
#Now, let's split this array at 3rd and 6th positions

print(x1,x2,x3)
#the output will have 3 arrays, having 3 elements in the first two and 4 elements in the third array

## <a name="arrayconcatenation"></a> Mathematical Fucntions
In addition to the functions we learned above, there are several other mathematical functions available in the numpy library such as sum, divide, multiple, abs, power, mod, sin, cos, tan, log, var, min, mean, max, etc. which you can be used to perform basic arithmetic calculations.

In [None]:
x

##### Finding the sum of two arrays.
Numpy arrays can be used for finding the sum of the numbers of two arrays in the same position. I.e instead of looping over the length of two arrays and finding the sum, we can add them up as regular numbers.

In [None]:
# Add the two numpy arrays together.
x+x

##### Power of Vectorization
Vectorization is a very powerful concept where the summation of an array can be done parallelly, thereby increasing speed. Let us now compare how this will give us speed improvements compared to a normal list by comparing the time taken to double the elements of a list vs the time taken to double the values of the numpy array.

In [6]:
import time

In [7]:
# Create an list with elements from 0 to 100000
a = [i for i in range(0, 100000)]
# Clock the start time
start_time = time.time()
# Start the loop for the operation
for i in range(0, 100000):
    a[i] = a[i]*2
print("Total time taken = ", time.time()-start_time)

Total time taken =  0.02624225616455078


Now compare the same for a numpy array

In [8]:
# Create a numpy array with elements from 0 to 100000
a = np.arange(0, 100000)
# Clock the start time
start_time = time.time()
# Start the loop for the operation
a = a*2
print("Total time taken = ", time.time()-start_time)

Total time taken =  0.0004336833953857422


As we can see, the improvements in speed is immense. This helps us a lot when we are processing data of a large length.

Apart from the summation and multiplication, we can perform all the standard arthmetic operations in this manner. 

##### Other operations on a numpy array
A numpy array has built in functions for computing mean, median, sum, etc. They can be done as follows.

In [None]:
# New numpy array
np_arr = np.arange(0, 10)

In [None]:
# Calculate the various attributes of the array.
mean = 
std = 
arr_sum = 
print(mean, std, arr_sum)

Let's move on to pandas now. Make sure you following each line below because it'll help you in doing data manipulation using pandas.

# <a name="pandas"></a>Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.

**Library features**

+ DataFrame object for data manipulation with integrated indexing.

+ Tools for reading and writing data between in-memory data structures and different file formats.

+ Data alignment and integrated handling of missing data.

+ Reshaping and pivoting of data sets.

+ Label-based slicing, fancy indexing, and subsetting of large data sets.

+ Data structure column insertion and deletion.

+ Group by engine allowing split-apply-combine operations on data sets.

+ Data set merging and joining.

+ Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data 
structure.
 
+ Time series-functionality: Date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.

+ Provides data filtration.

The library is highly optimized for performance, with critical code paths written in Cython or C.

In [None]:
# First, we can import Pandas
import pandas as pd

#here, pd is just an alias for the name pandas (short hand)

In [None]:
# Now we can create a data frame - dictionary is used here where keys get converted to column names and values to row values.
data = pd.DataFrame({'Country': ['Russia','Colombia','Chile','Equador','Nigeria'],
                    'Rank':[121,40,100,130,11]})
data

In [None]:
#We can do a quick analysis of any data set using describe():


Remember, describe() method computes summary statistics of integer / double variables. To get the complete information about the data set, we can use info() function.

In [None]:
#Among other things, it shows the data set has 5 rows and 2 columns with their respective names.

In [None]:
#Let's create another data frame.
data = pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
#Let's sort the data frame by ounces - inplace = True will make changes to the data


### <a name="sortingpandas"></a>Sorting the data
We can sort the data by not just one column but multiple columns as well.

In [None]:
# Sort the data using the column names group and ounces in ascending and descending respectively:


Often, we get data sets with duplicate rows, which is nothing but noise. Therefore, before training the model, we need to make sure we get rid of such inconsistencies in the data set. Let's see how we can remove duplicate rows.

In [None]:
#create another data with duplicated rows
data = 
data

In [None]:
#sort values with k2


In [None]:
#remove duplicates - ta da! 


Here, we removed duplicates based on matching row values across all columns. Alternatively, we can also remove duplicates based on a particular column. Let's remove duplicate values from the k1 column.

In [None]:
# Code here


Now, we will learn to categorize rows based on a predefined criteria. It happens a lot while data processing where you need to categorize a variable. For example, say we have got a column with country names and we want to create a new variable 'continent' based on these country names. In such situations, we will require the steps below:

In [None]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Now, we want to create a new variable which indicates the type of animal which acts as the source of the food. To do that, first we'll create a dictionary to map the food to the animals. Then, we'll use map function to map the dictionary's values to the keys. Let's see how is it done.

In [None]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

def meat_2_animal(series):
    if series['food'] == 'bacon':
        return 'pig'
    elif series['food'] == 'pulled pork':
        return 'pig'
    elif series['food'] == 'pastrami':
        return 'cow'
    elif series['food'] == 'corned beef':
        return 'cow'
    elif series['food'] == 'honey ham':
        return 'pig'
    else:
        return 'salmon'

In [None]:
#create a new variable
data['animal'] = # Use map to change the values.
data

In [None]:
#another way of doing it is: convert the food values to the lower case and apply the function
lower = lambda x: x.lower()
data['food'] = data['food'].apply(lower)
data['animal2'] = data.apply(meat_2_animal, axis='columns')
data

Another way to create a new variable is by using the assign function. With this tutorial, as you keep discovering the new functions, you'll realize how powerful pandas is.

In [None]:
data.assign(new_variable = data['ounces']*10)

In [None]:
#Let's remove the column animal2 from our data frame.

data.drop('animal2',axis='columns',inplace=True)
data

We frequently find missing values in our data set. A quick method for imputing missing values is by filling the missing value with any random number. Not just missing values, you may find lots of outliers in your data set, which might require replacing. Let's see how can we replace values.



In [None]:
#Series function from pandas are used to create arrays
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

In [None]:
#replace -999 with NaN values
data.replace(-999, np.nan,inplace=True)
data

In [None]:
#We can also replace multiple values at once.
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace([-999,-1000],np.nan,inplace=True)
data

**Now, let's learn how to rename column names and axis (row names).**

In [None]:

data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data

In [None]:
#Using rename function
data.rename(index = {'Ohio':'SanF'}, columns={'one':'one_p','two':'two_p'},inplace=True)
data

In [None]:
#You can also use string functions
data.rename(index = str.upper, columns=str.title,inplace=True)
data

In [None]:
#Next, we'll learn to categorize (bin) continuous variables.

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [None]:
# We'll divide the ages into bins such as 18-25, 26-35,36-60 and 60 and above.

#Understand the output - '(' means the value is included in the bin, '[' means the value is excluded
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

In [None]:
#To include the right bin value, we can do:
pd.cut(ages,bins,right=False)

In [None]:
#pandas library intrinsically assigns an encoding to categorical variables.
cats.codes

#cats.labels can also be used

In [None]:
#Let's check how many observations fall under each bin
pd.value_counts(cats)

In [None]:
#Also, we can pass a unique name to each label.

bin_names = ['Youth', 'YoungAdult', 'MiddleAge', 'Senior']
new_cats = pd.cut(ages, bins,labels=bin_names)

pd.value_counts(new_cats)

In [None]:
#we can also calculate their cumulative sum
pd.value_counts(new_cats).cumsum()

Let's proceed and learn about grouping data and creating pivots in pandas. It's an immensely important data analysis method which you'd probably have to use on every data set you work with.

In [None]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

In [None]:
#calculate the mean of data1 column by key1
grouped = df['data1'].groupby(df['key1'])
grouped.mean()


In [None]:
# Now, let's see how to slice the data frame.
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

In [None]:
#get first n rows from the data frame
df[:3]

In [None]:
#slice based on date range
df['20130101':'20130104']

In [None]:
#slicing based on column names
df.loc[:,['A','B']]

In [None]:
#slicing based on both row index labels and column names
df.loc['20130102':'20130103',['A','B']]

In [None]:
#slicing based on index of columns
df.iloc[3] 

#returns 4th row (index is 3rd)

In [None]:
#returns a specific range of rows
df.iloc[2:4, 0:2]

In [None]:
#returns specific rows and columns using lists containing columns or row indexes
df.iloc[[1,5],[0,2]] 

In [None]:
#Similarly, we can do Boolean indexing based on column values as well. This helps in filtering a data set based on a pre-defined condition.

df[df.A > 1]

In [None]:
#we can copy the data set
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2

In [None]:
#select rows based on column values
df2[df2['E'].isin(['two','four'])]

In [None]:
#select all rows except those with two and four
df2[~df2['E'].isin(['two','four'])]


In [None]:
#We can also use a query method to select columns based on a criterion. Let's see how!

#list all columns where A is greater than C
df.query('A > C')

In [None]:
#using OR condition
df.query('A < B | C > A')

Pivot tables are extremely useful in analyzing data using a customized tabular format. I think, among other things, Excel is popular because of the pivot table option. It offers a super-quick way to analyze data.

In [None]:
#create a data frame
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
#calculate means of each group
data.pivot_table(values='ounces',index='group',aggfunc=np.mean)


In [None]:
#calculate count by each group
data.pivot_table(values='ounces',index='group',aggfunc='count')

**Up till now, we've become familiar with the basics of pandas library using toy examples. Now, we'll take up a real-life data set and use our newly gained knowledge to explore it.**

#<a name="exploring"></a>Exploring an ML Dataset


We'll work with the popular adult data set.The data set has been taken from **UCI Machine Learning Repository**. You can download the [data from here](https://s3-ap-southeast-1.amazonaws.com/he-public-data/datafiles19cdaf8.zip). In this data set, the dependent variable is "target." It is a binary classification problem. We need to predict if the salary of a given person is less than or more than 50K.


In [None]:
'''Here, you've to upload the .csv data file from your system after downloading it from the above link.
Open the file and export the files to a position where it can be easily fetched'''



from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
!ls

In [None]:
#importing pandas using read_csv
import pandas as pd
filename = 'train.csv'
train = pd.read_csv(filename)
filename_test = 'test.csv'
test = pd.read_csv(filename_test)

In [None]:
#train
#un comment the above code to print the data in train.csv

#test
#un comment the above code to print the data in test.csv


In [None]:
#check data set
train.info()

We see that, the train data has 32561 rows and 15 columns. Out of these 15 columns, 6 have integers classes and the rest have object (or character) classes. Similarly, we can check for test data. An alternative way of quickly checking rows and columns is

In [None]:
print ("The train data has",train.shape)
print ("The test data has",test.shape)


In [None]:
#Let have a glimpse of the data set
train.head()

In [None]:
nans = train.shape[0] - train.dropna().shape[0]
print ("%d rows have missing values in the train data" %nans)

nand = test.shape[0] - test.dropna().shape[0]
print ("%d rows have missing values in the test data" %nand)

**#We should be more curious to know which columns have missing values.
**

In [None]:

#only 3 columns have missing values
train.isnull().sum()

In [None]:
#Let's count the number of unique values from character variables.

cat = train.select_dtypes(include=['O'])
cat.apply(pd.Series.nunique)

Since missing values are found in all 3 character variables, let's impute these missing values with their respective modes.

In [None]:
#Education
train.workclass.value_counts(sort=True)
train.workclass.fillna('Private',inplace=True)


#Occupation
train.occupation.value_counts(sort=True)
train.occupation.fillna('Prof-specialty',inplace=True)


#Native Country
train['native.country'].value_counts(sort=True)
train['native.country'].fillna('United-States',inplace=True)

In [None]:
# Let's check again if there are any missing values left.

train.isnull().sum()

Now, we'll check the target variable to investigate if this data is imbalanced or not.

In [None]:


#check proportion of target variable
train.target.value_counts()/train.shape[0]

We see that 75% of the data set belongs to <=50K class. This means that even if we take a rough guess of target prediction as <=50K, we'll get 75% accuracy. Isn't that amazing? Let's create a cross tab of the target variable with education. With this, we'll try to understand the influence of education on the target variable.

In [None]:
pd.crosstab(train.education, train.target,margins=True)/train.shape[0]

We see that out of 75% people with <=50K salary, 27% people are high school graduates, which is correct as people with lower levels of education are expected to earn less. On the other hand, out of 25% people with >=50K salary, 6% are bachelors and 5% are high-school grads. Now, this pattern seems to be a matter of concern. That's why we'll have to consider more variables before coming to a conclusion.

If you've come this far, you might be curious to get a taste of building your first machine learning model. In the coming week we'll share an exclusive tutorial on machine learning in python. However, let's get a taste of it here.

We'll use the famous and formidable scikit learn library. Scikit learn accepts data in numeric format. Now, we'll have to convert the character variable into numeric. We'll use the labelencoder function.

In label encoding, each unique value of a variable gets assigned a number, i.e., let's say a variable color has four values ['red','green','blue','pink'].

Label encoding this variable will return output as: red = 2 green = 0 blue = 1 pink = 3

In [None]:
#load sklearn and encode all object type variables
from sklearn import preprocessing

for x in train.columns:
    if train[x].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[x].values))
        train[x] = lbl.transform(list(train[x].values))

In [None]:
#Let's check the changes applied to the data set.

train.head()

In [None]:
#As we can see, all the variables have been converted to numeric, including the target variable.

#<50K = 0 and >50K = 1
train.target.value_counts()

#<a name="building"></a>Building a Random Forest Model

<a name="buildingthemodel"></a>Let's create a random forest model and check the model's accuracy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

y = train['target']
del train['target']

X = train
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#train the RF classifier
clf = RandomForestClassifier(n_estimators = 500, max_depth = 6)
clf.fit(X_train,y_train) no

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=6, max_features='auto', max_leaf_nodes=None,
                min_impurity_split=1e-07, min_samples_leaf=1,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
                verbose=0, warm_start=False)

clf.predict(X_test)

**<a name="prediction"></a>Now, let's make prediction on the test set and check the model's accuracy.**


In [None]:
#make prediction and check model's accuracy
prediction = clf.predict(X_test)
acc =  accuracy_score(np.array(y_test),prediction)
print ('The accuracy of Random Forest is {}'.format(acc))

#if you get a numpy module missing error, then import numpy

Well, we can do tons of things on this data and improve the accuracy. We'll learn about it in future articles. What's next?

In this tutorial, we divided the train data into two halves and made prediction on the test data. As your exercise, you should use this model and make prediction on the test data we loaded initially. You can perform same set of steps we did on the train data to perform the basic necessities on the available data. 

[Go to Top](#gotop)