# Getting Started with Splice Machine

## If you have any questions about the platform, or if you would like to schedule a demo with a member of our team, you can do so using the chatbot at the bottom right corner of the landing page. 

-----

# Analytics and Machine Learning powered by Spark
#### Splice Machine's platform is built around Spark. Begin by initializing a Spark Session, as well as our custom pysplice context.
The pysplice context allows you to create a Spark Dataframe using our Native Spark DataSource. The Native Spark DataSource allows you to create a Spark dataframe directly and instantly. No need to stream data using a database driver. 

In [None]:
#Begin spark session 
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#create pysplice context. Allows you to create a Spark dataframe using our Native Spark DataSource 
from splicemachine.spark import PySpliceContext
splice = PySpliceContext(spark)

-----

# Add data directly to the Splice Machine Database
#### Data can also be added to the database using the 'Data Import' tab on the landing page, through a Pandas/Spark dataframe, or through SQL, as shown below
Documentation about importing data into the database can be found [here](https://doc.splicemachine.com/bestpractices_ingest_import.html)

In [None]:
%%sql 
--This %% notation is a Jupyter magic command. It specifies that this cell is SQL, rather than the default Python.
--With Splice Machine, you can also specify a cell to use R, Scala, Java, Groovy and more, all in the same notebook!

-- Drop the sql table if it exists, and create a new one. 
drop table if exists cc_fraud_data; 
create table cc_fraud_data (
    time_offset integer,
    expected_weekly_trans_cnt double,
    expected_weekly_trans_amnt double,
    expected_daily_trans_cnt double,
    expected_daily_trans_amnt double,
    weekly_trans_cnt double,
    weekly_trans_amnt double,
    daily_trans_cnt double,
    daily_trans_amnt double,
    rolling_avg_weekly_trans_cnt double,
    rolling_avg_weekly_trans_amnt double,
    rolling_avg_daily_trans_cnt double,
    rolling_avg_daily_trans_amnt double,
    MACD_trans_amnt double,
    MACD_trans_cnt double,
    RSI_trans_amnt double,
    RSI_trans_cnt double,
    Aroon_trans_amnt double,
    Aroon_trans_cnt double,
    ADX_trans_amnt double,
    ADX_trans_cnt double,
    current_balance double,
    rolling_avg_balance double,
    MACD_balance double,
    Aroon_balance double,
    RSI_balance double,
    ADX_balance double,
    credit_score double,
    credit_limit double,
    amount decimal(10,2),
    class_result int
);

-- insert data directly into this table from s3
call SYSCS_UTIL.IMPORT_DATA (
     null,
     'cc_fraud_data',
     null,
     's3a://splice-demo/kaggle-fraud-data/creditcard.csv',
     ',',
     null,
     null,
     null,
     null,
     -1,
     's3a://splice-demo/kaggle-fraud-data/bad',
     null, 
     null);

##### Use ANSI SQL to query the table we just created
You can find documenation about our database [here](https://doc.splicemachine.com/sqlref_intro.html)

In [None]:
%%sql 
SELECT 
    class_result, 
    AVG(expected_weekly_trans_cnt) as avg_expected_weekly_trans_cnt, 
    AVG(MACD_trans_amnt) as avg_MACD_trans_amnt, 
    AVG(RSI_trans_amnt) as avg_RSI_trans_amnt
from cc_fraud_data
group by class_result

## Create a Spark dataframe using the Native Spark DataSource
#### The Native Spark DataSource creates a Spark dataframe directly from the database. There is no need to stream the data from the database to the Jupyter notebook, and back again. This is much faster because there is no serialization or deserialization. Data is accessed directly.

In [None]:
sql_query = f"SELECT * FROM cc_fraud_data"
df = splice.df(sql_query)

#### Take a look at your dataframe

In [None]:
import pandas as pd
df.limit(5).toPandas()

## Splice Machine has embedded the Spark UI directly into our notebooks, allowing you to easily track your job status

#### Analyize your dataframe using Spark

In [None]:
from pyspark.sql.functions import mean

df.groupBy("CLASS_RESULT").mean("CREDIT_SCORE").toPandas()

------

# Machine Learning with integrated, enhanced, MLflow
#### What is MLflow?
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. Splice Machine has embedded MLflow directly into our platform, and made several unique enhancements, further simplifing the ML lifecycle.

#### Take a look at the MLflow UI
If you haven't yet run any models, your UI will be mosly empty. Once models are run, you will see the results logged in this UI. 

In addition to being embedded in the notebook, the MLflow UI can be viewed in a seperate tab.

You can view our MLflow documentation [here](https://doc.splicemachine.com/mlmanager_api.html)

In [None]:
from splicemachine.mlflow_support import *
mlflow.register_splice_context(splice)

from splicemachine.notebook import get_mlflow_ui
get_mlflow_ui()

#### Create a new MLflow experiment
An experiment is a object that allows you to store multiple runs, or iterations, of your model training process

In [None]:
from splicemachine.mlflow_support import *
mlflow.register_splice_context(splice)

mlflow.set_experiment('Splice Machine Demo')

#### Train a model, tracking your steps using MLflow
Any type of model can be tracked using MLflow. In this example, we are creating a simple SparkML Random Forrest Classifer, trained on a subset of the data we loaded above. 

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from splicemachine.stats import SpliceMultiClassificationEvaluator
from datetime import datetime

# Create our dataset
df_subset = df.select('WEEKLY_TRANS_CNT','CURRENT_BALANCE','CREDIT_SCORE','AMOUNT','CLASS_RESULT' )
train, test = df_subset.randomSplit([0.8,0.2])

#log all aspects of the model training process as a single model run
with mlflow.start_run(run_name='sparkml run 1'):
        
    #Create MLflow tags. (any information about the modeling process we would like stored)
    mlflow.set_tag('teammates', 'carol, daniel')
    mlflow.lp('spark executors', '10')
    
    # Set our feature vector to be all column except the label
    va = VectorAssembler(inputCols = train.columns[:-1], outputCol='features')
    rf = RandomForestClassifier(labelCol='CLASS_RESULT', featuresCol='features')
    pipeline = Pipeline(stages=[va,rf])
    mlflow.log_feature_transformations(pipeline) #log what columns are used in the model 
    
    #train the model
    trainedModel = pipeline.fit(train)
    mlflow.log_model(trainedModel, 'spark_model')# Log our model for deployment or future use
    
    #Calculate predictions on test data
    predictions = trainedModel.transform(test)
    
    #Evaluate model performance metrics
    ev = SpliceMultiClassificationEvaluator(spark,labelCol='CLASS_RESULT')
    ev.input(predictions)
    mlflow.log_metrics(ev.get_results(as_dict = True)) #log the metrics you calculate 
    
    #log the Jupyter notebook itself to MLflow
    mlflow.log_artifact("Getting Started with Splice Machine.ipynb") 
    
    #end the Mlflow run
    run_id = mlflow.current_run_id()


#### Return to the MLflow UI, and see the model we have just trained

In [None]:
get_mlflow_ui()

-----

# Database Model Deployment
#### Deploy the Machine Learning model you have created as a table in the database. When new data is inserted into this table, a prediction is automatically generated and stored in that same database table. 

In the example below, we are deploying our model as a table called ```{schema}.spark_model```. The model itself is specified by run_id. The run_id was created by ```mlflow.log_model(trainedModel, 'spark_model') ``` in the code above. 

you can find documtation about Database Model Deployment here

In [None]:
#use your username as your database schema
from splicemachine.mlflow_support.utilities import get_user
schema = get_user()

In [None]:
#Drop the database table if it already exists 
splice._dropTableIfExists(f'{schema}.spark_model')

#Create a new table called {schema}.spark_model. Deploy the mode specified by run_id into that table. 
jid = mlflow.deploy_db(f'{schema}', 'spark_model', run_id, primary_key={'MOMENT_KEY': 'INT'}, df=df_subset, model_cols=df_subset.columns[:-1], create_model_table=True, classes= ['No Fraud', 'Fraud'])

#View logs of deployment process
mlflow.watch_job(jid)

#### Insert two rows of data into the table we created above. Select from that table to see the output
Only the features and a primary key are inserted into the table. Based on the features, the table automatically generates, and stores, the result of the Machine Learning Model. Additionally, metadata, such as who ran the model and when, are stored in the table.

In [None]:
%%sql
insert into spark_model (WEEKLY_TRANS_CNT,CURRENT_BALANCE,CREDIT_SCORE,AMOUNT, moment_key) values (-5.1, 3.5, 1.4, 0.2, 1);
insert into spark_model (WEEKLY_TRANS_CNT,CURRENT_BALANCE,CREDIT_SCORE,AMOUNT, moment_key) values (-5.1, 3.8, 100, 300, 2);
select * from spark_model

#### Insert two additional rows of data into the table
Each new row much have a unique primary key

In [None]:
%%sql
insert into spark_model (WEEKLY_TRANS_CNT,CURRENT_BALANCE,CREDIT_SCORE,AMOUNT, moment_key) values (-4.1, 3.9, 1.4, -3, 3);
insert into spark_model (WEEKLY_TRANS_CNT,CURRENT_BALANCE,CREDIT_SCORE,AMOUNT, moment_key) values (8.5, 114, 1120, 300, 4);
select * from spark_model

#### End your Spark Session

In [None]:
spark.stop()

-----

#### For more in-depth demos on how to use Splice Machine, including about our Feature Store, automatic model re-training, and Beakerx integraration, explore our other Jupyter Notebooks. 