# **The Analytics Continuum**
![analytics-continum](https://onedrive.live.com/embed?resid=8022b983441d23f2%2187918&authkey=%21AMOEYwhzclSMD7s&width=1246&height=674)

# Understand Data Science

#### Data science allows you to extract insights and knowledge from complex datasets. Most commonly, data scientists use these large datasets to train a machine learning model. A machine learning model allows you to generate predictions on new data.

#### Before learning how to train a model, explore the types of models you can train and how a typical data science process works.

# Explore common machine learning models
#### The purpose of machine learning is to train models that can identify patterns in large amounts of data. You can then use the patterns to make predictions that provide you with new insights on which you can take actions.

#### The possibilities with machine learning may appear endless, so let's begin by understanding the three common types of machine learning models:

![image-alt-text](https://learn.microsoft.com/en-us/training/wwl/get-started-data-science-fabric/media/machine-learning-tasks.png)

#### **Classification:** Predict a categorical value like whether a customer may churn.
#### **Regression:** Predict a numerical value like the price of a product.
#### **Forecasting:** Predict future numerical values based on time-series data like the expected sales for the coming month.

#### To decide which type of machine learning model you need to train, you first need to understand the business problem and the data available to you.

# **Understanding Data Science Process**
#### To train a machine learning model, the process commonly involves the following steps:
![Data Science Process](https://learn.microsoft.com/en-us/training/wwl/get-started-data-science-fabric/media/data-science-process.png)


 ##### **1. Define the problem:** Together with business users and analysts, decide on what the model should predict and when it's successful.
##### **2.  Get the data:** Find data sources and get access by storing your data in a Lakehouse.
##### **3.  Prepare the data:** Explore the data by reading it from a Lakehouse into a notebook. Clean and transform the data based on the model's requirements.
##### **4.  Train the model:** Choose an algorithm and hyperparameter values based on trial and error by tracking your experiments with MLflow.
##### **5.  Generate predictions:** Use model batch scoring to generate the requested predictions.

## Define the Problem

In [None]:
# to record the notebook running time
import time
ts = time.time()

# setup experiment name otherwise it will use the notebook name
import mlflow
mlflow.set_experiment("cust_churn")

In [None]:
import pyspark.sql.functions as F
df = spark.sql("SELECT * FROM ")
df = df.toPandas()

In [None]:
df.info()

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

print("Splitting data...")
X, y = df[['years_with_company','total_day_calls','total_eve_calls','total_night_calls','total_intl_calls','average_call_minutes','total_customer_service_calls','age']].values, df['churn'].values
   
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression
   
with mlflow.start_run():
    mlflow.autolog()

    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)

    mlflow.log_param("estimator", "LogisticRegression")

In [None]:
from sklearn.tree import DecisionTreeClassifier
   
with mlflow.start_run():
    mlflow.autolog()

    model = DecisionTreeClassifier().fit(X_train, y_train)
   
    mlflow.log_param("estimator", "DecisionTreeClassifier")

In [None]:
import mlflow
experiments = mlflow.search_experiments()
for exp in experiments:
    print(exp.name)

In [None]:
experiment_name = "cust_churn"
exp = mlflow.get_experiment_by_name(experiment_name)
print(exp)

In [None]:
mlflow.search_runs(exp.experiment_id)

In [None]:
mlflow.search_runs(exp.experiment_id, order_by=["start_time DESC"], max_results=2)

In [None]:
import matplotlib.pyplot as plt
   
df_results = mlflow.search_runs(exp.experiment_id, order_by=["start_time DESC"], max_results=2)[["metrics.training_accuracy_score", "params.estimator"]]
   
fig, ax = plt.subplots()
ax.bar(df_results["params.estimator"], df_results["metrics.training_accuracy_score"])
ax.set_xlabel("Estimator")
ax.set_ylabel("Accuracy")
ax.set_title("Accuracy by Estimator")
for i, v in enumerate(df_results["metrics.training_accuracy_score"]):
    ax.text(i, v, str(round(v, 2)), ha='center', va='bottom', fontweight='bold')
plt.show()

In [None]:
SEED = 1234 # Random seed
input_df = spark.read.format("delta").load("Tables/ChurnData")