Welcome to Tibyan AI's **Launch Into Machine Learning** training workshop. This Jupyter Notebook is an interactive companion to the slides provided during the lectures.

# SESSION 1: DATA-CENTRIC AI
After completing these exercises, at the end of Day 1, you should be able to:

* Load tabular data into a `pandas` dataframe
* Manipulate and organize the data via `pandas` dataframe filtering and indexing
* Visualize the data in `seaborn`

References: [Titanic on Kaggle](https://www.kaggle.com/competitions/titanic)

In [None]:
### Import libraries for data exploration and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Exercise 1.1: Collect & Visualize Data

### Loading Titanic data

We will load a dataset about who survived the Titanic disaster.

In [None]:
# Run bash command line in Jupyter Notebooks with "!"" to see the raw csv file
!head titanic_data/train.csv

head: cannot open 'titanic_data/train.csv' for reading: No such file or directory


Most of our analysis will be done in `pandas`. The [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function is full-featured and very useful.

In [None]:
train_data = pd.read_csv('titanic_data/train.csv')
print(f"The Dataframe (matrix) has {train_data.shape[0]} rows "
  + f"and {train_data.shape[1]} columns.\n")
print("Here's what some rows of data look like:")
display(train_data.head())

FileNotFoundError: [Errno 2] No such file or directory: 'titanic_data/train.csv'

### Exploring data
We've read our comma-separated value (CSV) file into a `pandas` [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). In this case, `train_data` is a Dataframe that's shaped like a matrix. In `pandas` Dataframes, you can refer to things by their column (or row) headings as well as by index in the matrix.

In [None]:
# Indexing & Selection. e.g., for item 2 of the 'Ticket' (8th) column
print(f"Referring by index, row 2 column 8:\n{train_data.iloc[2][8]}\n")
print(f"Referring by 'Ticket' column, row 2:\n{train_data['Ticket'][2]}\n")

print(f"Selecting a column by heading 'Ticket':\n{train_data['Ticket']}\n")

print("Selecting a row by the value in 'Ticket':")
print(train_data.loc[train_data['Ticket'] == 'STON/O2. 3101282'])

BONUS: If all we're doing is visualizing, Google Colab has nice tools for data exploration. We'll load the "Interactive Table" feature in Google Colab to explore `pandas` Dataframes very quickly.

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()
train_data

### Plotting a histogram with Seaborn
Often, we'd like to visualize the data with specific plots. Let's try to plot the relationship between `Survived` and `Age` to see if older or younger people are more likely to survive, using a kind of bar chart called a histogram (see Seaborn documentation on [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html)).

In [None]:
# Plot the original Age distribution
sns.histplot(data=train_data, x="Age", binwidth=10)

In [None]:
# Plot the Age distribution with survivor data overlaid
sns.histplot(data=train_data, x="Age", binwidth=10,
             stat="probability", multiple="fill", hue="Survived")

### Assignment 1.1
Often, we want to programmatically calculate specific statistics. Here, let's explore the relationship between the `Surivived` and `Sex` variables in the dataset. Write code to calculate and print the following:

* What percentage of the passengers
survived the Titanic?
* What percentage of the survivors were women? (Select survivors first)
* What percentage of women were survivors? (Select women first)
* Visualize the relationship between these two variables using a stacked histogram.

In [None]:
### Exercise 1.1(a)
### What percentage of the passengers survived the Titanic?

# <YOUR CODE HERE>
print(f"{}% of passengers survived the Titanic.")

### What percentage of the survivors were women? (Select survivors first)

# <YOUR CODE HERE>
print(f"{}% of survivors were women.")

### What percentage of women were survivors? (Select women first)

# <YOUR CODE HERE>
print(f"{}% of women were survivors.")

### Visualize the relationship b/t `Survived` and `Sex` via a stacked histogram

# <YOUR CODE HERE>
print(f"This plot breaks down who 'Survived' according to 'Sex'")


<!-- ### EXTRA: Flood disaster datasets
Look at some data sets on flooding. Say that you would like to ...

* Which of the [OpenFEMA](https://www.fema.gov/about/openfema/data-sets) data sets would you use if you were trying to find...
* Access your chosen dataset [via API](https://www.fema.gov/about/openfema/api). Namely, use the `requests` library to send `https` commands and get results according to the API.
* Pull a small amount of data to see its characteristics. -->


## Exercise 1.2: Dataset cleaning
In real life, data does not come in exactly the format you need for the problem you want to solve. You have to "wrangle" the data -- clean and transform it into something you can use. Let's try looking at some cryptocurrency (financial) data. (Adapted from Jongho Kim's introductory Jupyter Notebooks [1](https://github.com/jonghkim/financial-time-series-prediction-v2/blob/master/Module3/Hands-on-Labs/Lab_Orderbook_Data_Exploration.ipynb), [2](https://github.com/jonghkim/financial-time-series-prediction-v2/blob/master/Module2/Hands-on-Labs/Lab_Preprocessing_for_Cryptocurrency_Data.ipynb))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

### Loading cryptocurrency data
We'll look at some data that is a bit more raw than the previous Titanic data. `coin_price` is financial data for several cryptocurrencies.


In [None]:
# Pull data into pandas
coin_price = pd.read_csv("/content/crypto_data/coin_price_dfs.csv")
print(f"This dataframe is {coin_price.shape[0]} rows " \
      + f"and {coin_price.shape[1]} columns.")

orderbook = pd.read_csv("/content/crypto_data/orderbook_dfs.csv")
print(f"This dataframe is {orderbook.shape[0]} rows " \
      + f"and {orderbook.shape[1]} columns.")

coin_price.head()

### Assignment 1.2(a) Sorting by Time
For both `coin_price` and `orderbook`:

* There are 3 variables used to record time. Which among the three is inaccurate? (Hint: use `keys` to see headers; use `fromtimestamp` and `fromisoformat` from the [datetime](https://docs.python.org/3/library/datetime.html) library to convert between date formats).
* Keep only one accurate column for time with the lowest storage cost. (Hint: use `drop` from [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)).
* Sort the data according to time. (Hint: use `sort_values` from [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)).



In [None]:
time_label = ''
# <YOUR CODE HERE>

print(f"I am keeping the {time_label} column as my time")

### Plotting financial data
After selecting rows only pertaining to Bitcoin, plot the `last` recorded price of the cryptocurrency at the end of the time frame.

In [None]:
# Select rows pertaining to currency_types of interest only
currency_types = ["btc"]
coin_price = coin_price[coin_price["currency"].isin(currency_types)]

g = sns.lineplot(
    x=coin_price[time_label].apply(datetime.fromtimestamp),
    y=coin_price['last'])
plt.xticks(rotation=45)
plt.show()

### Assignment 1.2(b) Calculations and Missing values
Data is not pristine in real life. Data collection may produce missing or incomplete data, even in highly automated settings.

* How many missing values (`NaN`) are present in `coin_price`?  (Hint: use `isna` from [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html))

For `coin_price`, we calculate each timestamp's percentage change from the previous datapoint -- commonly used in stocks and save as a new `last_return` column.

* Since this is a time series, fill in data on the `log_last_return` column for `NaN`s by using `interpolate` from [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). This will estimate the value of a missing data point based on the values around it datapoint.
* What happens if you do not fill in `NaN`s (or `dropna`, for example)?

In [None]:
# <YOUR CODE HERE>
print(f"There are {} NaNs in coin_price")

# Calculate the % change from 0 (no change) in log scale
coin_price['last_return'] = coin_price['last'].pct_change()
coin_price['log_last_return'] = np.log(1 + coin_price['last_return'])

# <YOUR CODE HERE>
print(f"After interpolation: {coin_price.isna().sum()} NaNs remaining")

### EXTRA: Normalizing and plotting serial correlation
Serial correlation is the relationship between a given variable and a lagged version of itself over various time intervals. In finance, this correlation is used by technical analysts to determine how well the past price of a security predicts the future price. Considering the `series` variable that we calculate below:

* Plot the `log_last_return` using a `lag_plot` from the `pandas.plotting` library.

In [None]:
# coin_price = coin_price.interpolate() # ANSWER FROM ABOVE
series = (coin_price["log_last_return"] - coin_price["log_last_return"].mean())\
  / (coin_price["log_last_return"])

In [None]:
pd.plotting.lag_plot(series)
plt.show()


# SESSION 2: ANNOTATION AND EVALUATION
By the end of Session 2, you should be able to:

* Annotate gold standard data for an ML task
* Identify the components of an ML annotation project
* Split data (a `dataframe`) for use in a ML project
* Run a baseline ML algorithm on gold standard data


## Exercise 2.1: Annotation
Here's a [list](https://www.simonwenkel.com/lists/software/list-of-annotation-tools-for-machine-learning-research.html) of annotation software that is open-source and mostly tailored to a specific data type, and a few like [Label Studio](https://labelstud.io/guide/get_started.html) and [Universal Data Tool](https://universaldatatool.com/) that give you the typical use cases.

In this Exercise, you will use Label Studio to set up, annotate, and export an annotation project.

We'll look at a subset of the MNIST dataset, which are images of handwritten numbers that are a standard tutorial tool for machine learning classification models. While we will use the dataset for ML algorithms later, right now the main thing is to understand what goes into creating ML datasets.

### Assignment 2.1(a): Install Label Studio
Label Studio is a great tool but it needs to be installed locally on your machine.

* Follow the Label Studio installation guide [here](https://labelstud.io/guide/install) (for reference, we have tried it via pip (which requires Python) and Docker) to install and start Label Studio. You are done with this when you have a browser open to http://localhost:8080/, welcoming you to Label Studio.

### Assignment 2.1(b): Setup annotation project
* Go to `Projects` -> `Create`
  * Project Name: `MNIST_100 annotation`
  * Data Import: `Upload Files` and select all images in `mnist_100` after you have downloaded them locally from [here](https://drive.google.com/drive/folders/1s3l7Tx1OrT3sl1eA_cpoUcZpnlerocRj?usp=drive_link). You should get a confirmation "100 files uploaded."
  * Labeling setup: `Computer Vision` -> `Image Classification` (should display a stock photo of airplanes)
  * Save. You should get to a page where you are looking at rows of data.
* Click `Settings`
  * Select `General` -> `Task Sampling` -> `Random Sampling`
  * Select `Labeling Interface` -> `Browse Templates` and `Visual`.
  * Under `Add Choices` type the numbers 1-9 and 0, one per line, click `Add`. Click the red "x" next to "Adult content" and "Weapons" and "Violence". (The "UI Preview" should show `1[1]` and `2[2]` and `3[3]`, etc under the sample image.)
  * Click `Save`.

### Assignment 2.1(c): Label & Export data
* Click on the Project Name at the top of the page to return to the rows of data.
* Click on the `Label All Tasks` button.
  * For each image, type or select the correct number, then click `Submit`. Do this for all 100 photos.
* Click back on the Project Name, then select `Export` -> `JSON`. You should download a file like `project-1-at-nnnn.json`
* Close Label Studio once you have validated that you have a real JSON file.


### Assignment 2.1(d): Import data to Colab

* Upload the JSON file to Colab's `Files`.
* Upload the [`mnist_100` folder](https://drive.google.com/drive/folders/1s3l7Tx1OrT3sl1eA_cpoUcZpnlerocRj?usp=drive_link) to Colab.


### Extra: Check your gold standard annotations
Now that we have your gold standard labels on our data, let's read the data with [`pandas.read_json()`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) and inspect it.



In [None]:
filename = 'project-1.json'
import json

# Use `pandas.read_json` to bring the JSON data in
with open('project-1.json', 'r') as json_file:
  raw_data = json.load(json_file)

# Look at the 0th row (i.e., data for one image)
display(raw_data[0])

It is common for ML data such as these annotations to be buried in lots of messy places in a big JSON object.

* Iterate through all of our images and pull at pieces of information that we actually care about, such as `'choices'` and `'file_upload'`. (Check out [Working with Large Nested JSON Data](https://ankushkunwar7777.medium.com/get-data-from-large-nested-json-file-cf1146aa8c9e
).)

In [None]:
### NOT WORKING
# Iterate through the JSON object and look for things of interest
def extract_values(obj, key):
    """Pull all values of specified key from nested JSON."""
    arr = []

    def extract(obj, arr, key):
        """Recursively search for values of key in JSON tree."""
        if isinstance(obj, dict):
            for k, v in obj.items():
                if isinstance(v, (dict, list)):
                    extract(v, arr, key)
                elif k == key:
                    arr.append(v)
        elif isinstance(obj, list):
            for item in obj:
                extract(item, arr, key)
        return arr

    results = extract(obj, arr, key)
    return results

outcome = []
for row in raw_data:
    outcome.append(
       {'id' : row['annotations'][0]['result']['id'],
       'choice' : row['annotations'][0]['result'][0]['value']['choices'],
       'file' : row['file_upload'] })
# display(raw_data[0]['annotations'][0]['result']['id'])

# mnist_data = pd.read_json(filename)
# display(mnist_data.head())

## Exercise 2.2: Evaluation
Before we even put a new ML model into place, we can evaluate how well a _baseline model_ works in the same setting. Our baseline model doesn't need to be complicated; this time we'll prepare all the data and use a simple algorithm called a Decision Tree (you'll learn more about it in Session 3).

The Titanic data has a pre-defined train-test split -- `test.csv` being separate from `train.csv`. (In real life, we would need to create this split ourselves.) We will only evaluate `test_data` at the very end.


In [None]:
# This training data
train_data = pd.read_csv('titanic_data/train.csv')

# Load held-out test data now, but don't use it until the end
test_data = pd.read_csv('titanic_data/test.csv')

def print_relative_size_of_datasets(df1, df2):
  num_train_samples = df1.shape[0]
  num_test_samples = df2.shape[0]
  tot_samples = num_train_samples + num_test_samples
  print(f"The first dataset has {num_train_samples} samples "\
        + f"while the second dataset has {num_test_samples}; a ")
  print(f"{num_train_samples/tot_samples*100:.2f}"\
        + f"/{num_test_samples/tot_samples*100:.2f}"\
        + f" split of the data.")

print_relative_size_of_datasets(train_data, test_data)

Let's do a little quick-and-dirty processing to get rid of `NaNs` (it'd be best to consider each variable one at a time) and throw away some columns that probably won't help in the classification (they're completely unique for each passenger).

In [None]:
# Forward-fill to eliminate NaNs
train_data = train_data.fillna(method='ffill')
test_data = test_data.fillna(method='ffill')

# Drop some variables that probably won't be useful
train_data = train_data.drop(['Name', 'Ticket', 'Cabin'], axis=1)
test_data = test_data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

display(train_data.head())

### Data Splitting
In the meantime, we'll split the training dataset again, into a training and development/validation set. (Remember that we need a validation set so that we can set _hyperparameters_ before running an algorithm on the test data.

We'll use the classic ML library `sklearn` for a utility to help us do this.

In [None]:
from sklearn.model_selection import ShuffleSplit

# Get indices for a split
split = ShuffleSplit(n_splits = 1, test_size = 0.2)
# Iterate through (only 1) split, setting train/val data
for train_indices, test_indicices in split.split(train_data):
    train_set = train_data.loc[train_indices]
    val_set = train_data.loc[test_indicices]

display(train_set.head())

### Assignment 2.2(a): Stratified sampling
If, for example, your training data is 70% men and 30% women, but your test data is 80% women and 20% men, your ML model may not perform well on the test. This is called _sampling bias_. To help with this problem, we can try _stratifying_ the data according to variables of interest (e.g., `Sex`). This ensures both training and validation have similar distributions of the 'Survived', 'Pclass', and 'Sex' features for unbiased model evaluation.

* Write an alternative split, stratifying the split according to 'Survived', 'PClass', and 'Sex'. Save the output as `strat_train_set` and `strat_val_set`. (Hint: pass in the columns you want to stratify via `y` in `StratifiedShuffleSplit`'s [`split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split).)
* Verify that the percent of samples in each 'Pclass' value and `Sex` value are the same in `strat_train_set` and `strat_val_set`. (Hint: use Dataframe's `value_counts()` method.)

In [None]:
# <YOUR CODE HERE>

##### SOLUTION
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2)
for train_indices, test_indicices in split.split(
    train_data, train_data[["Survived" , "Pclass" , "Sex"]]):
    strat_train_set = train_data.loc[train_indices]
    strat_val_set = train_data.loc[test_indicices]

display(strat_train_set['Pclass'].value_counts() / strat_train_set.shape[0] * 100)
display(strat_val_set['Pclass'].value_counts() / strat_val_set.shape[0] * 100)

### Establishing a baseline
Besides splitting the data, we need a few more steps of data preparation before our machine learning algorithm can work.

1. We can get rid of columns that are unlikely to contribute to the predictions: see `.drop()` below.
2. We need to transform string data into categorical/numerical values (for the way some machine learning algorithms are optimized): see `pd.get_dummies()` below.

We'll cover how to do this more reliably in Session 3; for now, here are quick and dirty ways to do these.


In [None]:
X_train = pd.get_dummies(strat_train_set.drop(["Survived"], axis=1))
y_train = strat_train_set["Survived"]

X_val = pd.get_dummies(strat_val_set.drop(["Survived"], axis=1))
y_val = strat_val_set["Survived"]

display(X_train.head())

We'll use a basic _decision tree_ (we'll cover this in Session 3) as our baseline to classify between `Survived=0` and `Survived=1`.

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

clf.fit(X_train, y_train)
acc = clf.score(X_val, y_val)

print(f"Your {clf.__class__.__name__} predicts 'Survived'")
print(f" with an validation set accuracy of {acc*100:.2f}")

Congratulations! You've trained and run your first ML algorithm!

### Assignment 2.2(b): Evaluation practice
Now that we have a working ML classifier, let's look at the evaluation environment.

* ML algorithms often give different answers even with the same parameters. Write a loop that trains 5 classifiers and averages their scores. (Hint: vary or remove `random_state`.)
* ML algorithms have lots of options. Today, we're not focusing on what those options mean, but on how to test between them. Write a loop or other function that tests out the hyperparameters (options) for [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier): `criterion`, `splitter`, and `max_features`. Which values for each option give the best (averaged over 5) results?

### Extra: Cross-validation
Cross-validation on a training set can be very helpful for finding the best values. In `sklearn` you can do this with less code.

Use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) and/or [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) to find the best values for hyperparameters.

# SESSION 3: REGRESSION AND CLASSIFICATION
After completing these exercises , you should be able to:
- Implement data pipelines for streamlined ML-related preprocessing.
- Scale features with `StandardScaler`.
- Train a supervised machine learning algorithm with hyperparameter tuning.
- Make predictions and save results.

## Exercise 3.1: An ML Training Pipeline
In this exercise, we'll continue using the simple Titanic dataset to do basic *classification*, but we'll use the `sklearn` package's ecosystem of "estimators," "transformers," and pipelines to make the task robust.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
data = pd.read_csv('titanicData/train.csv')
test_data = pd.read_csv('titanicData/test.csv')

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2)
for train_indices, test_indicices in split.split( data, data[["Survived" , "Pclass" , "Sex"]]):
    strat_train_set = data.loc[train_indices]
    strat_val_set = data.loc[test_indicices]