<a href="https://colab.research.google.com/github/wellia/Machine_Learning/blob/main/fifaSocker_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Overview

Recently, [Kaggle](https://www.kaggle.com) (a data science community and competition platform) released one data set '[FIFA19](https://www.kaggle.com/karangadiya/fifa19)‘, which consists of 18K+ FIFA 19 player with around 90 attributes extracted from FIFA database. In this assessment task, we make it available as the data set:
- The data set is [2020T2Data.csv](https://github.com/tulip-lab/sit742/raw/master/Assessment/2020/data/2020T2Data.csv)

In this task **use Spark packages**

- **Part 1**: Exploratory Data Analysis

- **Part 2**: Clustering Analysis, and identify the position profiles of each cluster

- **Part 3**: Classification Analysis, and evaluate the performance of different algorithms using cross validation;





## Part 1 - What we could know about FIFA 2019 Players? 

### 1.0. Libraries and data files
<a id="Load data"></a>
***

Import the necessary Spark environment, and load the data set [2020T2Data.csv](https://github.com/tulip-lab/sit742/raw/master/Assessment/2020/data/2020T2Data.csv).


In [None]:
!pip install wget
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
import os,wget
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Assessment/2020/data/2020T2Data.csv'
DataSet = wget.download(link_to_data)

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=8ff62aa63637b755c010c4f00bba93362bd3894a51b1716d325d5ce98c4a779d
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import findspark
import numpy as np
findspark.init()
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt


### 1.1 Data Exploration

*Remind: Use **PySpark** to complete the following data processing and model building. Otherwise, you lose all marks.*

<a id="loading"></a>
***

<div class="alert alert-block alert-info">

**Code**: 
    import the csv file as a Spark dataframe and name it as df

</div>



In [None]:
# Import the '2020T2Data.csv' as a Spark dataframe and name it as df
spark = SparkSession.builder.appName('SIT742T2').getOrCreate()

df = spark.read.csv("2020T2Data.csv", header = True)
df.show(5)
df.printSchema()

+------+-----------------+---+--------------------+-----------+--------------------+-------+---------+-------------------+--------------------+--------+-------+-------+--------------+------------------------+---------+-----------+--------------+----------+---------+--------+-------------+------------+-----------+--------------------+------+----------+------+----------+--------+---------+---------------+------------+-------+---------+-----+----------+-----------+-----------+------------+-----------+-------+---------+-------+---------+-------+-------+--------+---------+----------+-------------+-----------+------+---------+---------+-------+--------------+-------------+--------+----------+---------+-------------+----------+-----------------+
|    ID|             Name|Age|               Photo|Nationality|                Flag|Overall|Potential|               Club|           Club Logo|value(M)|wage(K)|Special|Preferred Foot|International Reputation|Weak Foot|Skill Moves|     Work Rate| Body 

****


<div class="alert alert-block alert-info">

**Code**: 
    Check statistics (min, mean and max) for features on Age, Overall. Then find out the Avg Overall on Position, Avg Overall on Nationality (Sort by avg Overall on Nationality)

<div class="alert alert-block alert-warning">
    
**Report**: 
    **1.1.A** Please answer questions with proper section title  '1.1.A':     
    <ol>
        <li> Which are the (min, mean and max) for Age </li>
        <li> Which are the (min, mean and max) for Overall </li>
        <li> Which position the talented player (based on Avg Overall) are playing? </li>
        <li> Which are the top 3 countres that most likely have the genies player (based on sort of Avg Overall) and </li>
    </ol>
</div>
</div>

In [None]:
from pyspark.sql import functions as F

#Statistics on Age
df.select([F.min("Age"), F.mean("Age"),  F.max("Age")]).show()


+--------+------------------+--------+
|min(Age)|          avg(Age)|max(Age)|
+--------+------------------+--------+
|      16|25.122205745043114|      45|
+--------+------------------+--------+



In [None]:
#Statistics on Overall
df.select([F.min("Overall"), F.mean("Overall"),  F.max("Overall")]).show()

+------------+-----------------+------------+
|min(Overall)|     avg(Overall)|max(Overall)|
+------------+-----------------+------------+
|          46|66.23869940132916|          94|
+------------+-----------------+------------+



In [None]:
# Show top position which have highest average overall
df.groupby(['Position']).agg({"Overall": "AVG"}).sort("avg(Overall)", ascending=False).show(1)


In [None]:
#Top 3 countries most likely having good players
df.groupby(['Nationality']).agg({"Overall": "AVG"}).sort("avg(Overall)", ascending=False).show(3)

In [None]:
# avg potentials on country by position order by country
df_avg_potential = df.groupBy("Nationality").pivot("Position").agg({"Potential": "AVG"}).sort("Nationality")
print('Avg potential on country by position')
df_avg_potential.show(10)


# find position of top au player
Aus_avgpot = df.filter("Nationality == 'Australia'").groupby("Position").agg({'Potential':'AVG'}).sort('avg(Potential)', ascending=False).show(1)


In [None]:
# Code for plot
df1 = df.groupBy("Age").agg({"Potential": "AVG"}).withColumnRenamed("avg(Potential)", "Average").sort("Age")
df2 = df.groupBy("Age").agg({"Overall": "AVG"}).withColumnRenamed("avg(Overall)", "Average").sort("Age")

x1 = [i[0] for i in df1.select('Age').collect()]
y1 = [i[0] for i in df1.select('Average').collect()]

x2 = [i[0] for i in df2.select('Age').collect()]
y2 = [i[0] for i in df2.select('Average').collect()]

plt.figure(figsize=(15, 5))
plt.plot(x1, y1)
plt.plot(x2, y2)

## Part 2 - Unsupervised Learning: Kmeans

<a id="kmeans"></a>
***


### 2.1 Data Preparation

*Use **pyspark** to complete the following data processing and model building.*


****

<div class="alert alert-block alert-info">

**Code**: 
    You will need to remove the Goal Keepers (Position = 'GK') and only use the skillset attributes (Height(CM),
Weight(KG),
Crossing,
Finishing,
HeadingAccuracy,
ShortPassing,
Volleys,
Dribbling,
Curve,
FKAccuracy,
LongPassing,
BallControl,
Acceleration,
SprintSpeed,
Agility,
Reactions,
Balance,
ShotPower,
Jumping,
Stamina,
Strength,
LongShots,
Aggression,
Interceptions,
Positioning,
Vision,
Penalties,
Composure,
Marking,
StandingTackle,
SlidingTackle) 

</div>



In [None]:
# Your code to select relevent features and filtering by leaving out the GK
df = df.select('ID', 'Position', 'Height(CM)', 'Weight(KG)', 'Crossing', 'Finishing', 'HeadingAccuracy',  
           'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 
           'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance', 
           'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 
           'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle').filter("Position != 'GK'")
df.show()

To make the later stage easier, we define the position group by using the position feature.
- DEF = [LB,LWB,RB,LCB,RCB,CB,RWB] ,
- FWD = [RF,LF,LW,RS,RW,LS,CF,ST] ,
- MID = [LCM,LM,RDM,CAM,RAM,RCM,CM,CDM,RM,LAM,LDM]

****

<div class="alert alert-block alert-info">

**Code**: 
    Create a new column called Position_Group with only DEF/FWD/MID in the dataframe you created in previously

</div>

</div>

In [None]:
from pyspark.sql.functions import when,col,lit

# Define position group by position features
DEF = ['LB','LWB','RB','LCB','RCB','CB','RWB']
FWD = ['RF','LF','LW','RS','RW','LS','CF','ST']
MID = ['LCM','LM','RDM','CAM','RAM','RCM','CM','CDM','RM','LAM','LDM']

# create new column Position_Group with position group DEF/FWD/MD
df_kmeans_new = df.withColumn("Position_Group", when(col("Position").isin(DEF), lit("DEF"))
  .when(col("Position").isin(FWD), lit("FWD"))
  .otherwise(lit("MID")))
df_kmeans_new.printSchema()


Now, we remove the Position_Group and Position to create the feature for Kmeans




In [None]:
from pyspark.ml.feature import VectorAssembler

# remove Position_Group and Position to create the feature for Kmeans
FEATURES_COL = ['Height(CM)', 'Weight(KG)', 
                      'Crossing', 'Finishing', 'HeadingAccuracy', 
                      'ShortPassing', 'Volleys', 'Dribbling', 'Curve',
                      'FKAccuracy', 'LongPassing', 'BallControl', 
                      'Acceleration', 'SprintSpeed', 'Agility', 
                      'Reactions', 'Balance', 'ShotPower', 'Jumping', 
                      'Stamina', 'Strength', 'LongShots', 'Aggression', 
                      'Interceptions', 'Positioning', 'Vision', 'Penalties', 
                      'Composure', 'Marking', 'StandingTackle', 'SlidingTackle']

for col_name in FEATURES_COL:
    df_kmeans_new = df_kmeans_new.withColumn(col_name, col(col_name).cast('float'))

vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans_ = vecAssembler.transform(df_kmeans_new).select('ID','features')
df_kmeans_.show()

Now in order to evaluate your Kmeans Model, please plot the elbow plot


<div class="alert alert-block alert-info">

**Code**: 
    Plot the elbow plot, with a varying K from 2 to 20.

<div class="alert alert-block alert-warning">
    

</div>
</div>



In [None]:
from pyspark.ml.clustering import KMeans

# draw the elbow plot varying K from 2 to 20
cost = np.zeros(20)
for k in range(2,20):
    kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    model = kmeans.fit(df_kmeans_.sample(False,0.1, seed=42))
    cost[k] = model.computeCost(df_kmeans_) 

In [None]:
fig, ax = plt.subplots(1,1, figsize =(6,6))
ax.set_title('The Elbow method chart', fontsize=16)
ax.set_xlabel('Number of clusters', fontsize=12)
ax.set_ylabel('Cost', fontsize=12)
ax.set_xticks(np.arange(0, 20, 2))
ax.grid()
ax.plot(range(2,20),cost[2:20])



### 2.2 K-Means

Could you tell out the optimized K value? 




****

<div class="alert alert-block alert-info">

**Code**: 
    Choose a K value as 8 and then summarize each cluster with the count on Position_Group.

</div>




In [None]:
k = 8

# Your code
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans_)
print(type(model))
centers = model.clusterCenters()

print("Cluster Centers: ")
for center in centers:
    print(center)



In [None]:
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.functions import count

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# transform id to their cluster/prediction
modelTransformed = model.transform(df_kmeans_).select('ID', 'prediction')
rows = modelTransformed.collect()

# join prediction results to the original dataframe that contains Position_Group
df_kmeans_pred_ = sqlContext.createDataFrame(rows)
df_kmeans_pred_ = df_kmeans_pred_.join(df_kmeans_new, 'ID').withColumnRenamed("prediction", "Cluster")
df_kmeans_pred_.groupBy('Position_Group', 'Cluster').agg(count("*")).sort('Position_Group', 'Cluster').show()


## Part 3 - Supervised Learning: Classification on Position_Group

<a id="classification"></a>
***

In last part, you use the player's skillset values to segment the players into 8 clusters. Check whether we could accurately predict the position_group of the player.

*Ruse **PySpark** *


### 3.1 Data Preparation

We remove the feature of position and use all other skillset features and the cluster prediction as the input for the model. Your target for classification is "Position_Group".

In [None]:
FEATURES_COL_ = ['Height(CM)', 'Weight(KG)', 
                      'Crossing', 'Finishing', 'HeadingAccuracy', 
                      'ShortPassing', 'Volleys', 'Dribbling', 'Curve',
                      'FKAccuracy', 'LongPassing', 'BallControl', 
                      'Acceleration', 'SprintSpeed', 'Agility', 
                      'Reactions', 'Balance', 'ShotPower', 'Jumping', 
                      'Stamina', 'Strength', 'LongShots', 'Aggression', 
                      'Interceptions', 'Positioning', 'Vision', 'Penalties', 
                      'Composure', 'Marking', 'StandingTackle', 'SlidingTackle','Cluster']

vecAssembler_ = VectorAssembler(inputCols=FEATURES_COL_, outputCol="features")
df_class_ = vecAssembler_.transform(df_kmeans_pred_).select('features','Position_Group')
df_class_.show(3)


In many data science modeling work, feature scaling is very important.
In here, we use standard scaling on the fetaures.

In [None]:
from pyspark.ml.feature import StandardScaler

standardscaler=StandardScaler().setInputCol("features").setOutputCol("Scaled_features")
raw_data=standardscaler.fit(df_class_).transform(df_class_)
raw_data.select("features","Scaled_features",'Position_Group').show(5)

In Spark, you could not use string as Target data type, Please encode the Position_Group column by using following encoding: 

FWD = 0
DEF = 1
MID = 2

*Hint: Data type after encoding should be numeric.*

In [None]:
raw_data_ = raw_data.withColumn('Target',when(col("Position_Group") == "DEF", 1)
      .when(col("Position_Group")== "FWD", 0)
      .otherwise(2))

### 3.2 Training Test Evaluation

We remove the feature of position and use all other skillset features and the cluster prediction as the input for the model. The target for classification is "Position_Group".

Now, we split your data into train/Test, and evaluate one model's performance.

In [None]:
train, test = raw_data_.randomSplit([0.7, 0.3], seed=12)

In [None]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="Target", featuresCol="Scaled_features",maxIter=10)
model = lr.fit(train) #fit model with data

# prediction on test data
predict_test = model.transform(test)
predict_test.select("Target","prediction").show(10)



****


<div class="alert alert-block alert-info">

**Code**: 
    You are required to evaluate the model by using confusion matrix. Please also print out your model's Precision, Recall and F1 score.


In [None]:
from sklearn.metrics import confusion_matrix

y_true = predict_test.select("Target").toPandas()

y_prediction = predict_test.select("prediction").toPandas()

confusionMatrix = confusion_matrix(y_true, y_prediction)

print(confusionMatrix)

In [None]:
# Overall statistic
from sklearn.metrics import classification_report

classificationReport = classification_report(y_true,y_prediction)
print(classificationReport)

### 3.3 K-fold Cross-Validation

We surely missed something during the modeling work -- Hyperparameter tuning! We can use K-fold cross validation to find out the best hyperparameter set.

****


**Code**: 
    Implement K-fold cross validation for three (any three) classification models.

In [None]:
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# initialization 
multi_evaluator = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")

# -- random forest without tuning ---
rf = RandomForestClassifier(labelCol="Target", featuresCol="Scaled_features")
rf_model = rf.fit(train)
rf_predict_test = rf_model.transform(test)

rf_accuracy = multi_evaluator.evaluate(rf_predict_test, {multi_evaluator.metricName: "accuracy"}) #metricName can be fi, weightedPrecision, weightedRecall

#Should use this paramGrid, but it is very slow
#paramGrid = (ParamGridBuilder()
#             .addGrid(rf.maxDepth, [5,10,20,25,30])
#             .addGrid(rf.maxBins, [20, 60])
#             .addGrid(rf.numTrees, [5, 20,50,100])
#             .build())

# -- random forest with tuning
paramGrid = (ParamGridBuilder()
               .addGrid(rf.maxDepth, [2, 4, 6])
               .addGrid(rf.maxBins, [20, 60])
               .addGrid(rf.numTrees, [5, 20])
               .build())
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=multi_evaluator, numFolds=5)
cv_model = cv.fit(train)
rf_model_best = cv_model.bestModel

rf_predict_test_best = rf_model_best.transform(test)
rf_accuracy_best = multi_evaluator.evaluate(rf_predict_test_best, {multi_evaluator.metricName: "accuracy"}) 


In [None]:
# print comparison accuracy
print('random forest accuracy:', round(rf_accuracy, 2))
print('random forest best model accuracy:', round(rf_accuracy_best, 2))

# print parameters best model
rf_model_best.extractParamMap()
print('Random Forest best model params:', {param[0].name: param[1] for param in rf_model_best.extractParamMap().items()})

In [None]:
# print comparison confusion matrix

# random forest without tuning 
rf_true = rf_predict_test.select("Target").toPandas()
rf_prediction = rf_predict_test.select("prediction").toPandas()
rf_cm = confusion_matrix(rf_true, rf_prediction)
print('Confusion matrix random forest:\n', rf_cm)

brf_true = rf_predict_test_best.select("Target").toPandas()
brf_prediction = rf_predict_test_best.select("prediction").toPandas()
brf_cm = confusion_matrix(brf_true, brf_prediction)
print('\nConfusion matrix random forest best model:\n', brf_cm)

# using the precision, the recall, and the f1-score to compare the random forest model and the best random forest model
rf_overall = classification_report(rf_true, rf_prediction)
print('\nRandom Forest overall:\n', rf_overall)
brf_overall = classification_report(brf_true, brf_prediction)
print('\nRandom Forest best model overall:\n', brf_overall)

In [None]:
# Decision Trees

# --- Decision tree no tuning ---
dt = DecisionTreeClassifier(labelCol="Target", featuresCol="Scaled_features", maxDepth=3)
dt_model = dt.fit(train)
dt_predict_test = dt_model.transform(test)
dt_accuracy = multi_evaluator.evaluate(dt_predict_test, {multi_evaluator.metricName: "accuracy"}) 

# --- Decision tree with tuning ---
paramGrid = (ParamGridBuilder()
               .addGrid(dt.maxDepth, [2, 4, 6])
               .addGrid(dt.maxBins, [20, 60])
               .build())

# Create 5-fold CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=multi_evaluator, numFolds=5)

# Run cross validations.
cv_model = cv.fit(train)
dt_model_best = cv_model.bestModel
dt_predict_test_best = dt_model_best.transform(test)

dt_accuracy_best = multi_evaluator.evaluate(dt_predict_test_best, {multi_evaluator.metricName: "accuracy"}) 

In [None]:
# print comparison
print('decision tree accuracy:', round(dt_accuracy, 2))
print('decision_tree best model accuracy:', round(dt_accuracy_best, 2))
# print parameters
print("Decision tree best model params:", {param[0].name: param[1] for param in dt_model_best.extractParamMap().items()})

In [None]:
# decision tree confusion matrix
dt_true = dt_predict_test.select("Target").toPandas()
dt_prediction = dt_predict_test.select("prediction").toPandas()
dt_cm = confusion_matrix(dt_true, dt_prediction)
print('Decision tree matrix:\n', dt_cm)

# decision tree with tuning 
bdt_true = dt_predict_test_best.select("Target").toPandas()
bdt_prediction = dt_predict_test_best.select("prediction").toPandas()
bdt_cm = confusion_matrix(bdt_true, bdt_prediction)
print('\nDecision tree best model matrix:\n', bdt_cm)

# using the precision, the recall, and the f1-score to compare the decision trees model and the best decision trees model
dt_overall = classification_report(dt_true, dt_prediction)
print('\nDecision tree overall\n', dt_overall)
bdt_overall = classification_report(bdt_true, bdt_prediction)
print('\nDecision tree best model overall:\n', bdt_overall)

In [None]:
import timeit

# LogisticRegression

# --- LogisticRegression no tuning is done previously ---
lr_accuracy = multi_evaluator.evaluate(predict_test, {multi_evaluator.metricName: "accuracy"}) 

# --- LogisticRegression with tuning ---
#paramGrid = (ParamGridBuilder()
#               .addGrid(lr.aggregationDepth,[2,5,10])
#               .addGrid(lr.elasticNetParam,[0.0, 0.5, 1.0])
#               .addGrid(lr.fitIntercept,[False, True])
#               .addGrid(lr.maxIter,[5, 10, 100])
#               .addGrid(lr.regParam,[0.01, 0.5, 2.0])
#               .build())

paramGrid = (ParamGridBuilder()
               .addGrid(lr.maxIter,[5, 10, 100])
               .addGrid(lr.aggregationDepth,[2,5,10])
               .addGrid(lr.regParam,[0.01, 0.5, 2.0])
               .build())

cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=multi_evaluator, numFolds=5)

# Run cross validations.
start = timeit.default_timer()

cv_model = cv.fit(train)

stop = timeit.default_timer()
print('Time: ', stop - start)  

lr_model_best = cv_model.bestModel
lr_predict_test_best = lr_model_best.transform(test)
lr_accuracy_best = multi_evaluator.evaluate(lr_predict_test_best, {multi_evaluator.metricName: "accuracy"}) 

In [None]:
# print comparison
print('logistic regression accuracy:', round(lr_accuracy, 2))
print('logistic best model:', round(lr_accuracy_best, 2))
params = lr_model_best.extractParamMap()
print('Logistic regression best model params', {param[0].name: param[1] for param in lr_model_best.extractParamMap().items()})


In [None]:
# using the confusion matrix to compare the decision trees model and the best decision trees model
print('Logistic regression matrix:\n', confusionMatrix)

# LR best model
blr_true = lr_predict_test_best.select("Target").toPandas()
blr_prediction = lr_predict_test_best.select("prediction").toPandas()
blr_cm = confusion_matrix(blr_true, blr_prediction)
print('\nLogistic regression best model matrix:\n', blr_cm)

# using the precision, the recall, and the f1-score to compare the decision trees model and the best decision trees model
print('\nLogistic regression overall:\n', classificationReport)
blr_overall = classification_report(blr_true, blr_prediction)
print('\nLogistic regression best model overall:\n', bdt_overall)


In [None]:
# print accuracy comparison between models
print('logistic regression accuracy:', round(lr_accuracy, 2))
print('decision tree accuracy:', round(dt_accuracy, 2))
print('random forest accuracy:', round(rf_accuracy, 2))

# print hyperparameters of the chose model
# according to this https://www.silect.is/blog/2019/4/2/random-forest-in-spark-ml
# we choose which parameters worth of tuning it
print("Random Forest best model params:", {param[0].name: param[1] for param in rf_model_best.extractParamMap().items()})
params = {param[0].name: param[1] for param in rf_model_best.extractParamMap().items()}
# possible hyperparameters to be tuned: maxDepth, numTrees, maxBins, featureSubsetStrategy, minInfoGain , minInstancesPerNode 
# we only display these 3 because we only use these 3 in our paramGrids
print('Best hyper-parameters for this model:')
print('maxDepth:', params['maxDepth'])
print('numTrees:', params['numTrees'])
print('maxBins:', params['maxBins'])