# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [1]:
# load these packages
import pyspark
from pyspark.ml import feature, classification
from pyspark.ml import Pipeline, pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd

We will analyze the Mid-atlantic wage dataset (https://rdrr.io/cran/ISLR/man/Wage.html). 

In [2]:
# read-only
drop_cols = ['_c0', 'logwage', 'sex', 'region']
wage_df = spark.read.csv('/datasets/ISLR/Wage.csv', header=True, inferSchema=True).drop(*drop_cols)
training_df, validation_df, testing_df = wage_df.randomSplit([0.6, 0.3, 0.1], seed=0)
wage_df.printSchema()

root
 |-- year: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- maritl: string (nullable = true)
 |-- race: string (nullable = true)
 |-- education: string (nullable = true)
 |-- jobclass: string (nullable = true)
 |-- health: string (nullable = true)
 |-- health_ins: string (nullable = true)
 |-- wage: double (nullable = true)



In [3]:
# explore the data
wage_df.limit(10).toPandas()

Unnamed: 0,year,age,maritl,race,education,jobclass,health,health_ins,wage
0,2006,18,1. Never Married,1. White,1. < HS Grad,1. Industrial,1. <=Good,2. No,75.043154
1,2004,24,1. Never Married,1. White,4. College Grad,2. Information,2. >=Very Good,2. No,70.47602
2,2003,45,2. Married,1. White,3. Some College,1. Industrial,1. <=Good,1. Yes,130.982177
3,2003,43,2. Married,3. Asian,4. College Grad,2. Information,2. >=Very Good,1. Yes,154.685293
4,2005,50,4. Divorced,1. White,2. HS Grad,2. Information,1. <=Good,1. Yes,75.043154
5,2008,54,2. Married,1. White,4. College Grad,2. Information,2. >=Very Good,1. Yes,127.115744
6,2009,44,2. Married,4. Other,3. Some College,1. Industrial,2. >=Very Good,1. Yes,169.528538
7,2008,30,1. Never Married,3. Asian,3. Some College,2. Information,1. <=Good,1. Yes,111.720849
8,2006,41,1. Never Married,2. Black,3. Some College,2. Information,2. >=Very Good,1. Yes,118.884359
9,2004,52,2. Married,1. White,2. HS Grad,2. Information,2. >=Very Good,1. Yes,128.680488


# Question 1: Codify the data using transformers (20 pts)

Create a fitted pipeline to the entire data `wage_df` and call it `pipe_feat`. This pipeline should codify the columns `maritl`, `race`, `education`, `jobclass`, `health`, and `health_ins`. The codification should be a combination of a `StringIndexer` and a `OneHotEncoder`. For example, for `maritl`, `StringIndexer` should create a column `maritl_index` and `OneHotEncoder` should create a column `maritl_feat`. Investigate the parameters of `StringIndexer` so that the labels are indexed alphabetically in ascending order so that, for example, the 1st index for `maritl_index` corresponds to `1. Never Married`, the 2nd index corresponds to `2. Married`, and so forth. Also, investigate the parameters of  `OneHotEncoder` so that there are no columns dropped as it is usually done for dummy variables. This is, marital status should have one column for each of the classes.

The pipeline should create a column `features` that combines `year`, `age`, and all codified columns.

In [4]:
# create `pipe_feat` below
# YOUR CODE HERE

#Taking the names of columns from the original dataframe 'wage_df' in the list cols and removing the numeric columns (year and age)
# and the dependent variable (wage); one ot be predicted:
cols=wage_df.toPandas().columns
cols=list(set(cols)-set(["year","age","wage"]))

#Making lists for StringIndexers and OneHotEncoders for every categorical column in the dataframe:
string_indexer=[]
one_hot=[]

#Making lists for the columns names associated with every StringIndexer and OneHotEncoder
#Eg: Columns 'maritl' would have output column for StringIndexer as 'maritl_index' and OneHotEncoder output as 'maritl_feat':
feats_index=[]
feats_feat=[]

#Running a loop over all the columns and making StringIndexer, OneHotEncoder models for each column and also taking the eventual
#clumn names for making a dataframe ahead:
for col in cols:
    string_indexer.append(feature.StringIndexer(inputCol=col,outputCol=col+"_index"))
    one_hot.append(feature.OneHotEncoder(inputCol=col+"_index",outputCol=col+"_feat"))
    feats_index.append(col+"_index")
    feats_feat.append(col+"_feat")

#Finally, I added columns year and age to the final columns list, as well as columsn from StringIndexer and OneHotEncoder:
feats=["year","age"]+feats_index+feats_feat
print(feats)

#Since 'stages' in Pipeline takes argument as a list, making one list out of different lists I made earlier (one for StringIndexers
#and one for OneHotEncoders for different columns). Also adding a final VectorAssembler at the end:
pipe_stages=string_indexer+one_hot+[feature.VectorAssembler(inputCols=feats,outputCol="features")]

#Making the Pipeline and fitting it:
pipe_feat=Pipeline(stages=pipe_stages).fit(wage_df)

#Transforming data using the model from the pipeline fitting:
trans_df=pipe_feat.transform(wage_df)

#Taking a look at the dataframe thus made:
trans_df.toPandas().head()

#raise NotImplementedError()

['year', 'age', 'health_index', 'education_index', 'health_ins_index', 'race_index', 'jobclass_index', 'maritl_index', 'health_feat', 'education_feat', 'health_ins_feat', 'race_feat', 'jobclass_feat', 'maritl_feat']


Unnamed: 0,year,age,maritl,race,education,jobclass,health,health_ins,wage,health_index,...,race_index,jobclass_index,maritl_index,health_feat,education_feat,health_ins_feat,race_feat,jobclass_feat,maritl_feat,features
0,2006,18,1. Never Married,1. White,1. < HS Grad,1. Industrial,1. <=Good,2. No,75.043154,1.0,...,0.0,0.0,1.0,(0.0),"(0.0, 0.0, 0.0, 0.0)",(0.0),"(1.0, 0.0, 0.0)",(1.0),"(0.0, 1.0, 0.0, 0.0)","(2006.0, 18.0, 1.0, 4.0, 1.0, 0.0, 0.0, 1.0, 0..."
1,2004,24,1. Never Married,1. White,4. College Grad,2. Information,2. >=Very Good,2. No,70.47602,0.0,...,0.0,1.0,1.0,(1.0),"(0.0, 1.0, 0.0, 0.0)",(0.0),"(1.0, 0.0, 0.0)",(0.0),"(0.0, 1.0, 0.0, 0.0)","(2004.0, 24.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1..."
2,2003,45,2. Married,1. White,3. Some College,1. Industrial,1. <=Good,1. Yes,130.982177,1.0,...,0.0,0.0,0.0,(0.0),"(0.0, 0.0, 1.0, 0.0)",(1.0),"(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(2003.0, 45.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0..."
3,2003,43,2. Married,3. Asian,4. College Grad,2. Information,2. >=Very Good,1. Yes,154.685293,0.0,...,2.0,1.0,0.0,(1.0),"(0.0, 1.0, 0.0, 0.0)",(1.0),"(0.0, 0.0, 1.0)",(0.0),"(1.0, 0.0, 0.0, 0.0)","(2003.0, 43.0, 0.0, 1.0, 0.0, 2.0, 1.0, 0.0, 1..."
4,2005,50,4. Divorced,1. White,2. HS Grad,2. Information,1. <=Good,1. Yes,75.043154,1.0,...,0.0,1.0,2.0,(0.0),"(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0)",(0.0),"(0.0, 0.0, 1.0, 0.0)","(2005.0, 50.0, 1.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0..."


In [5]:
# investigate the results
pipe_feat.transform(wage_df).limit(5).toPandas().T

Unnamed: 0,0,1,2,3,4
year,2006,2004,2003,2003,2005
age,18,24,45,43,50
maritl,1. Never Married,1. Never Married,2. Married,2. Married,4. Divorced
race,1. White,1. White,1. White,3. Asian,1. White
education,1. < HS Grad,4. College Grad,3. Some College,4. College Grad,2. HS Grad
jobclass,1. Industrial,2. Information,1. Industrial,2. Information,2. Information
health,1. <=Good,2. >=Very Good,1. <=Good,2. >=Very Good,1. <=Good
health_ins,2. No,2. No,1. Yes,1. Yes,1. Yes
wage,75.0432,70.476,130.982,154.685,75.0432
health_index,1,0,1,0,1


In [6]:
# (20 pts)
assert set(type(pm) for pm in pipe_feat.stages) == {feature.OneHotEncoder, feature.StringIndexerModel, feature.VectorAssembler}
assert len(pipe_feat.transform(wage_df).first().features) == 22


# Question 2: (15 pts)

Create three pipelines that contain three different random forest regressions that take in all features from the `wage_df` to predict `wage`. These pipelines should have as first stage the pipeline created in question 1 and should be fitted to the training data.

- `pipe_rf1`: Random forest with `maxDepth=1` and `numTrees=60`
- `pipe_rf2`: Random forest with `maxDepth=3` and `numTrees=40`
- `pipe_rf3`: Random forest with `maxDepth=6`, `numTrees=20`

In [7]:
# create the fitted pipelines `pipe_rf1`, `pipe_rf2`, and `pipe_rf3` here
# YOUR CODE HERE
from pyspark.ml.regression import RandomForestRegressor

#Using the pipe_feat made previously, I made 3 pipelines that used that pipeline and used the output from that pipeline as 
#an input for the random forest regressor model. I am using different parameters that the professor asked. The pipelines 
#are fit upon the training_df, which is 60% of the original data:

#The labelCol is the dependent variable columns, in my case the 'wage' column:

random_forest_1=RandomForestRegressor(featuresCol="features",labelCol="wage",maxDepth=1,numTrees=60)
pipe_rf1=Pipeline(stages=[pipe_feat,random_forest_1]).fit(training_df)

random_forest_2=RandomForestRegressor(featuresCol="features",labelCol="wage",maxDepth=3,numTrees=40)
pipe_rf2=Pipeline(stages=[pipe_feat,random_forest_2]).fit(training_df)

random_forest_3=RandomForestRegressor(featuresCol="features",labelCol="wage",maxDepth=6,numTrees=20)
pipe_rf3=Pipeline(stages=[pipe_feat,random_forest_3]).fit(training_df)

#raise NotImplementedError()

In [8]:
# tests for 15 pts
np.testing.assert_equal(type(pipe_rf1.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(pipe_rf2.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(pipe_rf3.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(pipe_rf1.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf2.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf3.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf1.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf2.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf3.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 3 (10 pts)

Use the following evaluator to compute the RMSE of the models on validation data. Print the RMSE of the three models and assign the best one (i.e., the best pipeline) to a variable `best_model`

In [9]:
evaluator = evaluation.RegressionEvaluator(labelCol='wage', metricName='rmse')
# use it as follows:
#   evaluator.evaluate(fitted_pipeline.transform(df)) -> RMSE

In [10]:
# print MSE of each model and define `best_model`
# YOUR CODE HERE

#Evaluating RMSE for the 3 models made on the validation_df (30% of the original data, exclusive of the training_df):

RMSE1=evaluator.evaluate(pipe_rf1.transform(validation_df))
RMSE2=evaluator.evaluate(pipe_rf2.transform(validation_df))
RMSE3=evaluator.evaluate(pipe_rf3.transform(validation_df))

print("Root Mean Square Error for pipe_rf1: "+str(RMSE1))
print("Root Mean Square Error for pipe_rf2: "+str(RMSE2))
print("Root Mean Square Error for pipe_rf3: "+str(RMSE3))

#The model with the lowest RMSE is assigned to 'best_model':
if RMSE1==min(RMSE1,RMSE2,RMSE3):
    best_model=pipe_rf1
elif RMSE2==min(RMSE1,RMSE2,RMSE3):
    best_model=pipe_rf2
else:
    best_model=pipe_rf3

#raise NotImplementedError()

Root Mean Square Error for pipe_rf1: 36.214626318832885
Root Mean Square Error for pipe_rf2: 33.56759433852202
Root Mean Square Error for pipe_rf3: 33.241366680095744


In [11]:
# tests for 10 pts
np.testing.assert_equal(type(best_model.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(best_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(best_model.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 4: 5 pts

Compute the RMSE of the model on testing data, print it, and assign it to variable `RMSE_best`

In [12]:
# create RMSE_best below
# YOUR CODE HERE

#Taking the RMSE of the 'best_model' and assigning it to 'RMSE_best':
RMSE_best=evaluator.evaluate(best_model.transform(testing_df))

#raise NotImplementedError()

In [13]:
# tests for 5 pts
np.testing.assert_array_less(RMSE_best, 40)
np.testing.assert_array_less(30, RMSE_best)

# Question 5: 5 pts

Using the parameters of the best model, create a new pipeline called `final_model` and fit it to the entire data (`wage_df`)

In [14]:
# create final_model pipeline below
# YOUR CODE HERE

#Using the parameters (stages) from the 'best_model' and making a pipeline first and then fitting that pipeline to the
#whole 'wage_df'. This model is assigned to 'final_model':

final_model=Pipeline(stages=[best_model.stages[0],best_model.stages[1]]).fit(wage_df)

#raise NotImplementedError()

In [15]:
# tests for 5 pts
np.testing.assert_equal(type(final_model.stages[0]), pipeline.PipelineModel)
np.testing.assert_equal(type(final_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(final_model.transform(wage_df)), pyspark.sql.dataframe.DataFrame)

# Question 6: 30 pts

Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features. Give appropriate column names such as `maritl_1._Never_Married`. You can build these feature names by using the labels from the fitted `StringIndexer` used in Question 1. Use as feature importance as determined by the random forest of the final model (`final_model`). Sort the pandas dataframe by `importance` in descending order and display.

In [16]:
# create feature_importance below
# YOUR CODE HERE

#So, I start with taking the first stage from the 'final_model' which is the 'pipe_feat' I made earlier:
inner_pipeline=final_model.stages[0]

#Taking names of all the columns from 'wage_df' and removing the numerical columns (year and age) and the dependent variable wage:
cols=list(set(wage_df.columns)-set(["wage","year","age"]))

#I made a empty list for all the column names that'd be used for the dataframe 'feature_importance' eventually:
col_list=[]

#From the 'pipe_feat' I extracted from 'final_model', I am reading all the stages (be it StringIndexer or OneHotEncoder or VectorAssembler)
#I am working only with StringIndexers and taking the lables from the StringIndexers to have the columns that are used in
#the random forest model. This makes a list of lists. The inner lists have labels corresponding to all possible classes
#that particular categorical variable can be:
for ele in inner_pipeline.stages:
    if str(type(ele))=="<class 'pyspark.ml.feature.StringIndexerModel'>":
        col_list.append(ele.labels)

#Just checking the lengths, they are the same, they both have a length of 6:
#print(len(col_list))
#print(len(cols))

#Now I know they are having a one to one relation, because they are two different ends of the same pipeline:
i=0

#I am taking the numeric columns in this list, that will have other columns. The other columns are from StingIndexer, 
#which did not have the numerical columns, so had to explicitly add them:
col1=["year","age"]

#I am reading the different lists inside the bigger list  (col_list). For every item inside the inner list (different categorical values
#that particular column can take), I am making a string which starts with the column name and is followed by the categorical value for that label:
for ls in col_list:
    for item in ls:
        col1.append((str(cols[i])+" "+str(item)).replace(" ","_"))
    i=i+1

#It is basically such that at the same index for both the lists, the 'cols' list has a column name and the list at the same
#index in 'col_list' is a list of all possible categorical values that particular column can take. Finally, 'col1' will have
#all feature names used in random forest the way it is asked.
   
#Then I am assigning the random forest model from the 'final_model' to 'rf_stage':
rf_stage=final_model.stages[-1]

#Taking the importance weights:
weights=list(rf_stage.featureImportances.toArray())

#Checking the length to make sure the columns list (col1; all the factors used in random forest) and the importance values
#list is of the same length. They both have a length of 22:
print(len(col1))
print(len(weights))

#Making a dataframe from those 2 lists, naming the columns as 'feature' and 'importance' and sorting the dataframe by 'importance'
#values in a descending manner:
feature_importance=pd.DataFrame(list(zip(col1,weights)),columns=["feature","importance"]).sort_values("importance",ascending=False)

#raise NotImplementedError()

22
22


In [17]:
# display your feature importances here
feature_importance

Unnamed: 0,feature,importance
3,health_1._<=Good,0.243577
12,race_2._Black,0.17057
1,age,0.110303
4,education_2._HS_Grad,0.071034
13,race_3._Asian,0.065522
10,health_ins_2._No,0.05848
18,maritl_1._Never_Married,0.044088
19,maritl_4._Divorced,0.043712
7,education_5._Advanced_Degree,0.040551
0,year,0.032153


In [18]:
# tests for 25 pts
assert type(feature_importance) == pd.core.frame.DataFrame
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])

**(5 pts)** Comment below on the importance that random forest has given to each feature. Are they reasonable? Do they tell you anything valuable about the titanic dataset? Answer in the cell below

The importance values acquired from the random forest model tell us about the impact of the features on the outcome (in this case, prediciton of 'wage'). The value of importance weight is based upong the coefficient value in the regressor; bigger the coefficient, greater is the importance.

However, it is to be noted that random forest model will average the importance for features over all the trees in the forest. That is, importance for, say age, will be average of importance of 'age' over all the trees in the model. The average values are also normalized to be in the range 0 to 1.

From the feature_importance dataframe, it can be said that 'health_1.\_<=Good' is the most important feature with the weight 0.243577 for the random forest model as a whole. These importance weights tell us about the intensity of impact on the prediction ('wage'), however, they do not tell us whether the features impact prediciton of 'wage' positively or negatively.

P.S.: I have experienced that re-running the kernel and restarting the whole code will start at different starting points and thus everything changes, that is, the importance weights, the features having thos weights, the trees.
This interpretation and answer is based upon the output that is a result of this execution instance.

# Question 7:  15 pts.

Pick any of the trees from the final model and assign its `toDebugString` property to a variable `example_tree`. Print this variable and add comments to the cell describing how you think this particular tree is fitting the data

In [19]:
# create a variable example_tree with the toDebugString property of a tree from final_model.
# print this string and comment in this same cell about the branches that this tree fit
# YOUR CODE HERE

#I like the number 3, so I am assigning the tree with index 3 from my random forest model (4th tree from my model) to 'example_tree'
#I used the rf_stage I made in the previous block of code:
print(len(rf_stage.trees))
 

#raise NotImplementedError()

20


In [20]:
#I wrote an extra block of code. Random forest regressor model does not show the actual name of feature used, but rather
#shows up as 'feature {some number}'. So I wrote a code to map the label from model to the actual data labels:

#Row labels for 'feature_importance' are actually the numbers that will be in front of 'feature' in the ranodm forest model
#So I am taking the row labels (index) in a list:
num=list(feature_importance.index)

#Taking names of the actual features from the 'feature_importance' dataframe in another list. The row labels and name of feature
#correspond when it comes to indices in these lists:
feat=list(feature_importance["feature"])

#Finally, making a dictionalt to map these feature numbers to actual features:
feat_dir={}
i=0
while i<len(num):
    feat_dir[num[i]]=feat[i]
    i=i+1

for key in sorted(feat_dir):
    print("feature "+str(key)+" -> "+str(feat_dir[key]))

feature 0 -> year
feature 1 -> age
feature 2 -> health_2._>=Very_Good
feature 3 -> health_1._<=Good
feature 4 -> education_2._HS_Grad
feature 5 -> education_4._College_Grad
feature 6 -> education_3._Some_College
feature 7 -> education_5._Advanced_Degree
feature 8 -> education_1._<_HS_Grad
feature 9 -> health_ins_1._Yes
feature 10 -> health_ins_2._No
feature 11 -> race_1._White
feature 12 -> race_2._Black
feature 13 -> race_3._Asian
feature 14 -> race_4._Other
feature 15 -> jobclass_1._Industrial
feature 16 -> jobclass_2._Information
feature 17 -> maritl_2._Married
feature 18 -> maritl_1._Never_Married
feature 19 -> maritl_4._Divorced
feature 20 -> maritl_5._Separated
feature 21 -> maritl_3._Widowed


In [21]:
# display the tree here
print(example_tree)

DecisionTreeRegressionModel (uid=dtr_37ae944ded4a) of depth 6 with 109 nodes
  If (feature 12 in {0.0})
   If (feature 4 in {1.0})
    If (feature 19 in {1.0})
     If (feature 9 in {0.0})
      If (feature 1 <= 22.5)
       If (feature 10 in {0.0})
        Predict: 54.63495928454017
       Else (feature 10 not in {0.0})
        Predict: 67.7371818694598
      Else (feature 1 > 22.5)
       If (feature 1 <= 32.5)
        Predict: 70.12158800435735
       Else (feature 1 > 32.5)
        Predict: 82.57053620624168
     Else (feature 9 not in {0.0})
      If (feature 0 <= 2003.5)
       If (feature 14 in {1.0})
        Predict: 61.40764018289706
       Else (feature 14 not in {1.0})
        Predict: 99.6894636984864
      Else (feature 0 > 2003.5)
       If (feature 1 <= 28.5)
        Predict: 74.80824208516948
       Else (feature 1 > 28.5)
        Predict: 91.85999080433996
    Else (feature 19 not in {1.0})
     If (feature 3 in {0.0,4.0})
      If (feature 1 <= 39.5)
       If (featur

In [22]:
# tests for 10 points
assert type(example_tree) == str
assert 'DecisionTreeRegressionModel' in example_tree
assert 'feature 0' in example_tree
assert 'If' in example_tree
assert 'Else' in example_tree
assert 'Predict' in example_tree

**(5 pts)** Comment on the feature that is at the top of the tree. Does it make sense that that is the feature there?

The illustration of 'example_tree", that is the tree with index 3 (4th tree in the random forest model), shows me that the 'feature 12' is at the top of the tree. The dictionary I made earlier, that maps the feature number to the actual feature name from the dataframe, tells me that feature 12 is race_2.\_Black; which is the dummy for race being black or not (1 if race is black, 0 if it is not). 

When I look back at the feature_importance dataframe, it shows that this particular feature is ranked 2nd on the basis of its importance in the random forest model with an importance value of 0.170570. The importance value in that dataframe is average of importance over all the trees in the model. In this particular tree, this feature is the most important feature, and it does make sense for this feature to be at the top of the tree. My argument is its importance value is pretty high so it makes sense to make a good split.

P.S.: I have experienced that re-running the kernel and restarting the whole code will start at different starting points and thus everything changes, that is, the importance weights, the features having thos weights, the trees.
This interpretation and answer is based upon the output that is a result of this execution instance.