Welcome to Synnax Lab! In this tutorial, I'll guide you through the entire process—from setting up your account to making your first submission. Let's go! 🚀

# Creating Account

## Step: 1 Navigate to Synnax Lab

Let's go to [Synnax Lab](https://synnax.app/) and click on the **LOG IN** button.


![Synnax](https://github.com/synnax-ai/synnax-lab-sdk/blob/update_manual/samples/img/step_1.png?raw=true)

## Step 2: Choosing Your Login Method

Synnax allows you to log in with Coinbase and MetaMask. For this tutorial, I'll be using MetaMask.

![Login Options](img/step_2.png)


## Step 3: Connect Your MetaMask Account

A pop-up window will appear, prompting you to connect your MetaMask account. Follow the instructions to connect, and then click **Next** to proceed.

![Connect MetaMask](img/step_3.png)

## Step 4: Provide Necessary Permissions

You will be prompted to provide the necessary permissions. Review the permissions request and click **Confirm** to proceed.

![Provide Permissions](img/step_4.png)

## Step 5: Confirm signature request

You will be prompted to provide the necessary permissions. Review the permissions request and click **Confirm** to proceed.

![image](img/step_5.png)

## Step 6: Account Settings

Click on your profile icon and select **Settings** from the dropdown menu.

![Profile and Settings](https://www.imghippo.com/filedownload/bSnug1725463001.png)

## Step 6: Get approval from the Synnax Team

In the top right corner click on your profile field, then 'Settings'

![Settings](img/step_6_1.png)

Then copy your 'Name', by default it is set to you wallet id, you can change it later to your actual name/nickname

![Name](img/step_6_2.png)

### Send you name as a TG direct message to Danil Zherebtsov @danil_com

You will get your approval very quickly and after that proceed to the next step.

## Step 7: Create a New API Key

In the **API Keys** section, click on **New API Key** to generate a new key.

![API Keys](img/step_7.png)


## Step 8: Name Your Key

Give your API key a name, click each checkbox and **Create** to finalize the process.

![Name and Create API Key](img/step_8.png)


## Step 9: Copy Your API Key

Once your API key is created, **copy** it immediately. This key will not be available after closing the modal.

![Copy API Key](img/step_9.png)

🙏 You have successfully created your account and obtained an API key for Synnax Lab.  

💪 Now, roll up your sleeves, get your hands dirty, and make some magic happen!

# Data Science Meets Finance: Let’s Get Coding!

## Installing SDK

To make your journey from fetching data to submitting predictions as smooth as possible, Synnax Lab has created an SDK that's as effortless as a Sunday morning. Let’s install it and get started!

In [11]:
! pip install synnax_lab_sdk -q

## Imports

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.multioutput import MultiOutputRegressor
from lightgbm import LGBMRegressor

# Importing Synnax Lab SDK Client
from synnax_lab_sdk.client import SynnaxLabClient

## Fetching Datasets

Above script will create a synnax-lab folder in the current working directory where all the downloaded datasets will be stored.
`files` object is a dictionary with files names and their respective paths.

In [None]:
synnax_lab_client = SynnaxLabClient(api_key = "your_api_key")

files = synnax_lab_client.get_datasets()

In [5]:
files

{'x_train_path': 'synnax-data/datasets/X_train.csv',
 'targets_train_path': 'synnax-data/datasets/targets_train.csv',
 'x_forward_looking_path': 'synnax-data/datasets/X_forward_looking.csv',
 'macro_train_path': 'synnax-data/datasets/macro_train.csv',
 'macro_forward_looking_path': 'synnax-data/datasets/macro_forward_looking.csv',
 'sample_submission_path': 'synnax-data/datasets/sample_submission.csv',
 'data_dictionary_path': 'synnax-data/datasets/data_dictionary.txt',
 'dataset_date': '2024-09-10'}

## Dataset Structure 📂

<pre>
📂 synnax-data
│   └── 📂 datasets
│       ├── 📜 data_dictionary.txt
│       ├── 📊 macro_forward_looking.csv
│       ├── 📈 macro_train.csv
│       ├── 📝 sample_submission.csv
│       ├── 🎯 targets_train.csv
│       ├── 🔮 X_forward_looking.csv
│       └── 📚 X_train.csv
</pre>

### 📋 Description:
The datasets subdirectory includes everything you need:

- **`X_train.csv`**: Your training data with financial features.
- **`targets_train.csv`**: The targets you're predicting in training.
- **`X_forward_looking.csv`**: The test data where you’ll make your predictions.
- **`macro_train.csv`** & **`macro_forward_looking.csv`**: Macroeconomic data to enrich your model.
- **`sample_submission.csv`**: Shows you how to format your predictions for submission.



## Loading Dataset

In [150]:
X_train = pd.read_csv(files['x_train_path'])  # Training features
X_forward_looking = pd.read_csv(files['x_forward_looking_path'])  # Test features
targets_train = pd.read_csv(files['targets_train_path'])  # Training targets
# macro_train = pd.read_csv(files['macro_train_path'])  # Historical macroeconomic data
# macro_forward_looking = pd.read_csv(files['macro_forward_looking_path'])  # Future macroeconomic data

# Clean column names in macroeconomic datasets by removing unsupported characters (mainly spaces)
# macro_train = macro_train.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
# macro_forward_looking = macro_forward_looking.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

Let's take a quick look at the first few rows of our training data.

In [106]:
X_train.head()

Unnamed: 0,companyId,metadata_0,industry,sector,metadata_1,metadata_2,metadata_3,metadata_4,lastUpdatedAnnumEndDate,lastUpdatedQuarterEndDate,...,Y_0_feature_122,Y_0_feature_95,Y_0_feature_40,Y_0_feature_56,Y_0_feature_54,Y_0_feature_101,Y_0_feature_99,Y_0_feature_124,Y_0_feature_128,Y_0_feature_43
0,company_21230,TH,Pollution & Treatment Controls,Industrials,0.50005,,0.50235,1448.3148,2023-12-31,2024-06-30,...,101.58625,1038.36685,655.429,2.8003,89.5904,0.5,0.5,265.91925,1398.6361,52.52565
1,company_352,US,Biotechnology,Healthcare,,0.4999,4.77695,7216.3456,2023-12-31,2024-06-30,...,10.95,5813.1,1794.7356,0.5,522.1,0.5,0.5,16951.75,5813.1,-11138.1
2,company_3853,CN,Airports & Air Services,Industrials,0.50015,0.50015,47.32115,88190.89465,2023-12-31,2024-06-30,...,20768.33075,78092.83165,14408.75425,50.11065,17663.72665,-375.16215,0.5,19524.86935,106229.494,36702.64605
3,company_20796,TH,Resorts & Casinos,Consumer Cyclical,0.50005,,3.15225,10077.5209,2023-12-31,2024-06-30,...,9796.72115,23419.375,5230.67575,0.5,1747.9649,-41.68705,2018.1635,4246.88795,43695.68365,-4591.63235
4,company_9742,US,Electronic Components,Technology,0.50175,0.5012,6.8843,335875.7512,2023-12-31,2024-06-30,...,42953.5,124008.95,1246.13905,10099.7,13353.05,-4234.3,701.5,50616.75,167605.7,89133.6


## Checking Categorical Columns

In [151]:
cat_cols = X_train.select_dtypes(include='object').columns
cat_cols

Index(['companyId', 'metadata_0', 'industry', 'sector',
       'lastUpdatedAnnumEndDate', 'lastUpdatedQuarterEndDate'],
      dtype='object')

Don't miss the two columns containing dates! 📅

In [153]:
# Check the number of unique values in the categorical columns
X_train[cat_cols[1:]].nunique()

metadata_0                    79
industry                     143
sector                        11
lastUpdatedAnnumEndDate       19
lastUpdatedQuarterEndDate      5
dtype: int64

# Process data

In [154]:
# Combine training and test data for consistent encoding
data = pd.concat([X_train, X_forward_looking], axis=0)

## Extract features from dates

In [155]:
# Extract features out of datetime cols
datetime_cols = ['lastUpdatedAnnumEndDate', 'lastUpdatedQuarterEndDate']

for col in datetime_cols:
    data[col] = pd.to_datetime(data[col])
    data[f'{col}_day'] = data[col].dt.day
    data[f'{col}_day_of_week'] = data[col].dt.day_of_week
    data[f'{col}_day_of_year'] = data[col].dt.day_of_year
    data[f'{col}_month'] = data[col].dt.month
    data[f'{col}_is_month_start'] = data[col].dt.is_month_start
    data[f'{col}_is_month_end'] = data[col].dt.is_month_end
    data[f'{col}_quarter'] = data[col].dt.quarter
    data[f'{col}_is_quarter_start'] = data[col].dt.is_quarter_start
    data[f'{col}_is_quarter_end'] = data[col].dt.is_quarter_end
    data[f'{col}_year'] = data[col].dt.year

data.drop(datetime_cols, axis=1, inplace=True)

cat_cols = [col for col in cat_cols if col not in datetime_cols + ['companyId']]

## Encode categorical variables

In [156]:
# Encode categorical columns using LabelEncoder
for col in cat_cols:
    data[col] = LabelEncoder().fit_transform(data[col])

There are multiple ways to encode categoric variables and they should be selected given the model you will be using and the nature of the data.

Tree-based models can take advantage of any kind of encoding, while for linear models `LabelEncoder` might not be a viable option because it makes the categories ordinal.

Try implementing:
- one-hot-encoding
- label-encoding
- frequency-encoding
- mean-target-encoding

to find the best option for your model.

## Handling Missing Values

### Define Function

In [157]:
def fill_missing_with_mean(df):
    """
    Fill missing values with the mean of each column.
    """
    for col in df.columns:
        if df[col].isnull().any():
            df[col] = df[col].fillna(df[col].mean())
    return df

### Apply Function to Datasets

In [158]:
data = fill_missing_with_mean(data)


In [159]:
# drop columns with all missing (sometimes that happens)
for col in data:
    if data[col].isnull().all():
        print(col)
        data.drop(col, axis=1, inplace=True)

metadata_8
city


Missing values imputation is a tricky process. While replacing NaNs with mean of the column technically makes your dataset for modeling, it might populate your *ground truth* training data with a lot of faulty values. In many cases mean of the column will not be a right choice.

Consider:
- median
- mode
- KNN imputer
- verstack.NaNImputer
- etc.

In [160]:
# Split data back into training and test sets
X_train = data[:X_train.shape[0]]
X_forward_looking = data[X_train.shape[0]:]

## Dropping `companyId`

In [161]:
X_train = X_train.drop('companyId', axis=1)
X_forward_looking = X_forward_looking.drop('companyId', axis=1)
targets_train = targets_train.drop('companyId', axis=1)
# macro_train = macro_train.drop('companyId', axis=1)
# macro_forward_looking = macro_forward_looking.drop('companyId', axis=1)

## Training

### Example Hyperparameters

In [162]:
# define starter parameters for LGBMRegressor
params = {
    'learning_rate': 0.01,
    'num_leaves': 250,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.9,
    'verbosity': -1,
    'random_state': 42,
    'device_type': 'cpu',
    'objective': 'regression',
    'metric': 'l2',
    'num_threads': 10,
    'lambda_l1': 0.5,
    'n_estimators': 100
    }

### Train the Model

We will use the same type of model with fixed parameters to predict each of the 17 targets. `sklearn.multioutput.MultiOutputRegressor` will simplify our code and rather than producing code to train 17 independent models, we can use `MultiOutputRegressor` to do it for us in a convenient sklearn-style one-liner.

But remember, each target, even though it represents financial indicators from one company and one time-period, may have different dependencies and even may require different types of models. So try different strategies to improve your score.

In [163]:
# Initialize the model with parameters
regressor = MultiOutputRegressor(LGBMRegressor(**params))
# from sklearn.linear_model import Ridge
# regressor = MultiOutputRegressor(Ridge())

# Fit the model on training data
regressor.fit(X_train, targets_train)

### Make Predictions

In [164]:
predictions = regressor.predict(X_forward_looking)

## Submitting Predictions

### Load the Sample Submission File

In [165]:
sample_submission = pd.read_csv(files['sample_submission_path'])

### Update the Submission File with Predictions

In [168]:
for col in sample_submission.columns[1:]:
    sample_submission[col] = sample_submission[col].astype(float)

sample_submission.iloc[:, 1:] = predictions

### Save Submission File

In [169]:
# Save the updated submission file
submission_path = 'synnax-submissions/submission.csv'
sample_submission.to_csv(submission_path, index=False)

### Submit the Predictions

In [None]:
synnax_lab_client.submit_predictions(files["dataset_date"], submission_path)

## Check confidence score (validation)

Scores get calculated on the synnax-lab-sdk backend with a small lag, so in case running the below function does not return the `confidenceScore` right away, give it a few seconds and rerun. A `confidenceScore` is there when the submission `status: 'Processed'`

In [147]:
synnax_lab_client.get_past_submissions()

[{'id': '4a809ed2-a34b-409a-895b-85d913817a9d',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'Processed',
  'confidenceScore': -1030.0046503478281,
  'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',
  'uploadedAt': '2024-09-10T11:17:30.540Z'},
 {'id': 'ba10218d-4293-49de-9c56-36d7eb0c0657',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'Processed',
  'confidenceScore': -1030.0046503478281,
  'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',
  'uploadedAt': '2024-09-10T11:16:48.016Z'},
 {'id': '4140f050-eb0e-4e25-bf99-d787a26fa367',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'Processed',
  'confidenceScore': 0.12233956654794893,
  'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',
  'uploadedAt': '2024-09-10T10:47:02.448Z'},
 {'id': 'dbfa0841-31bc-4580-bb4c-c647d1e9f628',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'P

## Congratulations! 🥳 You’ve Made Your First Submission!

This was a basic pipeline. Now, it's time to level up! 🚀 Use the macroeconomic data to make your model more robust. Experiment with different models, tweak them, and maybe even try some neural networks. 🧠💥

Data science is like cooking. There are endless recipes to try. So, spice things up, preprocess like a pro, and get those scores soaring! 🌟👨‍🍳👩‍🍳

Good luck, and may the data be ever in your favor! 🍀📈

# P.S.

Above pipeline is a simple example of how to get started with synnax-lab-sdk and arrive at your first submission.

To improve your scores look into:
1. Macroeconomic data
2. Experiment with other categoric variables encoding options (try individual mean-target-encoding for each target)
3. Deal with outliers
4. More advanced missing values imputation options
5. Different models, individual models for each tartet, hyperparameters tuning
6. Models ensembling (if using different models, make sure you have appropriate processing for each model)
7. Feature selection: not all features in X_train, macro_train can be useful. Try to remove some of them.
8. And of course your creativity. We're confident that your individual approach can beat anything we have layed out in our short tutorial