Welcome to Synnax Lab! In this tutorial, I'll guide you through the entire process—from setting up your account to making your first submission. Let's go! 🚀

# Creating Account

## Step: 1 Navigate to Synnax App

Let's go to [Synnax App](https://synnax.app/) and click on the **LOG IN** button.

![Synnax App image](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_1.png)

## Step 2: Sign up for Synnax App

Synnax allows you to sign up/log in with a wallet or an email. Sign up using either option and finish the sign up process. This will register you in the Synnax App, which is currently open for everyone. It allows registered users to view the credit intelligence for the companies Synnax features. To become a contributor proceed to Step 3.

![Login Options](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_2.png)


## Step 3: Become a contributor

Apply for Synnax Lab membership as a Data Scientist. In the Synnax Lab section click **EARN WITH SYNNAX LAB** button.

![Signup Options](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_3.png)

### 3.1 Fill out the form and click Join

We give preference to applications who fill in their **real names and agree to participate in a short introductory video call**.

![Signup Form](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_3.1.png)

## Step 4: Get application approval

Your application will be processed in the order when it was received. We are approving 3 applications per week after a short 15 minute introductory video call with each applicant. Typical wait time for interview scheduling is 1-2 weeks.

## Step 5: Create a New API Key

In the **API Keys** section, click on **NEW API KEY** to generate a new key.

![API Keys](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_5.png)


## Step 6: Name Your Key

Give your API key a name, click each checkbox and **CREATE** to finalize the process.

![Name and Create API Key](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_6.png)


## Step 7: Copy Your API Key

Once your API key is created, **copy** it immediately. This key will not be available after closing the modal.

![Copy API Key](https://synnax-ai.github.io/synnax-lab-sdk/tutorials/img/step_7.png)

🙏 You have successfully created your account and obtained an API key for Synnax Lab.  

💪 Now, roll up your sleeves, get your hands dirty, and make some magic happen!

# Data Science Meets Finance: Let’s Get Coding!

## Installing SDK

To make your journey from fetching data to submitting predictions as smooth as possible, Synnax Lab has created an SDK that's as effortless as a Sunday morning. Let’s install it and get started!

In [11]:
!pip install synnax-lab-sdk

## Imports

In [None]:
# Importing necessary libraries
import time
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.multioutput import MultiOutputRegressor
from lightgbm import LGBMRegressor

# Importing Synnax Lab SDK Client
from synnax_lab_sdk.client import SynnaxLabClient

## Fetching Datasets

Below script will create a synnax-lab folder in the current working directory where all the downloaded datasets will be stored.
`files` object is a dictionary with files names and their respective paths. User can download training data including or excluding the optional macroeconomical datasets.

In [None]:
synnax_lab_client = SynnaxLabClient(api_key = "your_api_key")

files = synnax_lab_client.get_datasets(with_macro_data=True)

In [None]:
print(files)

{'x_train_path': 'synnax-data/datasets/X_train.csv',
 'targets_train_path': 'synnax-data/datasets/targets_train.csv',
 'x_forward_looking_path': 'synnax-data/datasets/X_forward_looking.csv',
 'macro_train_path': 'synnax-data/datasets/macro_train.csv',
 'macro_forward_looking_path': 'synnax-data/datasets/macro_forward_looking.csv',
 'sample_submission_path': 'synnax-data/datasets/sample_submission.csv',
 'data_dictionary_path': 'synnax-data/datasets/data_dictionary.txt',
 'dataset_date': '2024-09-10'}

## Dataset Structure 📂

<pre>
📂 synnax-data
│   └── 📂 datasets
│       ├── 📜 data_dictionary.txt
│       ├── 📊 macro_forward_looking.csv
│       ├── 📈 macro_train.csv
│       ├── 📝 sample_submission.csv
│       ├── 🎯 targets_train.csv
│       ├── 🔮 X_forward_looking.csv
│       └── 📚 X_train.csv
</pre>

### 📋 Description:
The datasets subdirectory includes everything you need:

- **`X_train.csv`**: Your training data with financial features.
- **`targets_train.csv`**: The targets you're predicting in training.
- **`X_forward_looking.csv`**: The test data where you’ll make your predictions.
- **`macro_train.csv`** & **`macro_forward_looking.csv`**: Macroeconomic data to enrich your model.
- **`sample_submission.csv`**: Shows you how to format your predictions for submission.



## Loading Dataset

In [None]:
X_train = pd.read_csv(files['x_train_path'])  # Training features
X_forward_looking = pd.read_csv(files['x_forward_looking_path'])  # Test features
targets_train = pd.read_csv(files['targets_train_path'])  # Training targets
# macro_train = pd.read_csv(files['macro_train_path'])  # Historical macroeconomic data
# macro_forward_looking = pd.read_csv(files['macro_forward_looking_path'])  # Future macroeconomic data

Let's take a quick look at the first few rows of our training data.

In [9]:
X_train.head()

Unnamed: 0,company_id,country_code,sector,industry,Q4_feature_1,Q4_feature_2,Q4_feature_3,Q4_feature_4,Q4_feature_5,Q4_feature_6,...,Q1_feature_23,Q1_feature_24,Q1_feature_25,Q1_feature_26,Q1_feature_27,Q1_feature_28,Q1_feature_29,Q1_feature_30,Q1_feature_31,Q1_feature_32
0,nYVQRtTxDB3EaeaD98BuKu,CN,Real Estate,Real Estate - Development,1765.480403,239.770275,240.021202,14.165341,809.387895,484.444817,...,10849.537197,0.513777,404.894598,453.264525,9055.69634,2202.184254,2674.29486,14393.208305,1243.012346,2550.561774
1,hH9tnYKK4W8WEabfHh9H8U,CN,Healthcare,Medical Devices,1731.733563,521.355546,522.461691,516.979559,713.358055,1663.43433,...,5754.838324,4640.349737,1496.118795,47.318466,5158.087021,2413.507752,26111.851856,34411.324614,275.411816,19557.256849
2,CCmxeGSwRDWmbNPuVzqya2,SE,Healthcare,Drug Manufacturers - Specialty & Generic,25.606556,-58.775449,-58.736227,-62.454359,49.037967,45.634476,...,94.077165,0.513777,55.167809,0.513777,43.602616,43.547076,840.763409,938.305588,5.859282,370.93264
3,2mwDetR8L6pkGSAXKXAw7o,CA,Basic Materials,Gold,5.353338,-4.332356,-4.325634,-12.495115,4.642288,0.513777,...,7.266907,0.513777,0.513777,0.513777,7.239456,22.830083,520.297297,525.640181,7.201703,327.951471
4,XESj5RGnAGZEXTsgacy55D,US,Technology,Computer Hardware,4288.896504,-1915.436767,-1336.213232,-1883.617589,0.513777,2941.012811,...,7381.71684,4407.750871,8772.202299,528.269268,3629.511427,6730.917023,37884.227751,48263.64617,794.825292,25006.612919


## Checking Categorical Columns

In [6]:
cat_cols = X_train.select_dtypes(include='object').columns.tolist()
print(cat_cols)

['company_id', 'country_code', 'sector', 'industry']


In [7]:
# Check the number of unique values in the categorical columns
X_train[cat_cols[1:]].nunique()

country_code     80
sector           11
industry        137
dtype: int64

# Process data

In [None]:
# Combine training and test data for consistent encoding
data = pd.concat([X_train, X_forward_looking], axis=0)

## Encode categorical variables

In [20]:
# Encode categorical columns using LabelEncoder
for col in cat_cols:
    data[col] = LabelEncoder().fit_transform(data[col])

There are multiple ways to encode categoric variables and they should be selected given the model you will be using and the nature of the data.

Tree-based models can take advantage of any kind of encoding, while for linear models `LabelEncoder` might not be a viable option because it makes the categories ordinal.

Try implementing:
- one-hot-encoding
- label-encoding
- frequency-encoding
- mean-target-encoding

to find the best option for your model.

## Handling Missing Values

### Define Function

In [21]:
def fill_missing_with_mean(df):
    """
    Fill missing values with the mean of each column.
    """
    for col in df.columns:
        if df[col].isnull().any():
            df[col] = df[col].fillna(df[col].mean())
    return df

Keep in mind that there are multiple ways to deal with NaNs. We have chosen the simplest one for this tutorial.

### Apply Function to Datasets

In [22]:
data = fill_missing_with_mean(data)


In [23]:
# drop columns with all missing (sometimes that happens)
for col in data:
    if data[col].isnull().all():
        print(col)
        data.drop(col, axis=1, inplace=True)

Missing values imputation is a tricky process. While replacing NaNs with mean of the column technically makes your dataset for modeling, it might populate your *ground truth* training data with a lot of faulty values. In many cases mean of the column will not be a right choice.

Consider:
- median
- mode
- KNN imputer
- verstack.NaNImputer
- etc.

In [160]:
# Split data back into training and test sets
X_train = data[:X_train.shape[0]]
X_forward_looking = data[X_train.shape[0]:]

## Dropping `company_id`

In [None]:
X_train = X_train.drop('company_id', axis=1)
X_forward_looking = X_forward_looking.drop('company_id', axis=1)
targets_train = targets_train.drop('company_id', axis=1)
# macro_train = macro_train.drop('company_id', axis=1)
# macro_forward_looking = macro_forward_looking.drop('company_id', axis=1)

## Training

### Example Hyperparameters

In [162]:
# define starter parameters for LGBMRegressor
params = {
    'learning_rate': 0.01,
    'num_leaves': 250,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.9,
    'verbosity': -1,
    'random_state': 42,
    'device_type': 'cpu',
    'objective': 'regression',
    'metric': 'l2',
    'num_threads': 10,
    'lambda_l1': 0.5,
    'n_estimators': 100
    }

### Train the Model

We will use the same type of model with fixed parameters to predict each of the 17 targets. `sklearn.multioutput.MultiOutputRegressor` will simplify our code and rather than producing code to train 17 independent models, we can use `MultiOutputRegressor` to do it for us in a convenient sklearn-style one-liner.

But remember, each target, even though it represents financial indicators from one company and one time-period, may have different dependencies and even may require different types of models. So try different strategies to improve your score.

In [163]:
# Initialize the model with parameters
regressor = MultiOutputRegressor(LGBMRegressor(**params))
# from sklearn.linear_model import Ridge
# regressor = MultiOutputRegressor(Ridge())

# Fit the model on training data
regressor.fit(X_train, targets_train)

### Make Predictions

In [164]:
predictions = regressor.predict(X_forward_looking)

## Submitting Predictions

### Load the Sample Submission File

In [165]:
sample_submission = pd.read_csv(files['sample_submission_path'])

### Update the Submission File with Predictions

In [168]:
for col in sample_submission.columns[1:]:
    sample_submission[col] = sample_submission[col].astype(float)

sample_submission.iloc[:, 1:] = predictions

### Save Submission File

In [169]:
# Save the updated submission file
submission_path = 'submission.csv'
sample_submission.to_csv(submission_path, index=False)

### Submit the Predictions

In [None]:
synnax_lab_client.submit_predictions(files["dataset_date"], submission_path)
time.sleep(5) # 5 seconds to process submission before running next cell

## Check confidence score (validation)

Submissions require some time to get processed in the backend, so in case running the below function does not return the `confidenceScore` right away, give it a few seconds and rerun. A `confidenceScore` is there when the submission `status: 'Processed'`

In [147]:
synnax_lab_client.get_past_submissions()

[{'id': '4a809ed2-a34b-409a-895b-85d913817a9d',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'Processed',
  'confidenceScore': -1030.0046503478281,
  'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',
  'uploadedAt': '2024-09-10T11:17:30.540Z'},
 {'id': 'ba10218d-4293-49de-9c56-36d7eb0c0657',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'Processed',
  'confidenceScore': -1030.0046503478281,
  'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',
  'uploadedAt': '2024-09-10T11:16:48.016Z'},
 {'id': '4140f050-eb0e-4e25-bf99-d787a26fa367',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'Processed',
  'confidenceScore': 0.12233956654794893,
  'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',
  'uploadedAt': '2024-09-10T10:47:02.448Z'},
 {'id': 'dbfa0841-31bc-4580-bb4c-c647d1e9f628',
  'datasetDate': '2024-09-10',
  'originalFilename': 'submission.csv',
  'status': 'P

## Congratulations! 🥳 You’ve Made Your First Submission!

This was a basic pipeline. Now, it's time to level up! 🚀 Use the macroeconomic data to make your model more robust. Experiment with different models, tweak them, and maybe even try some neural networks. 🧠💥

Data science is like cooking. There are endless recipes to try. So, spice things up, preprocess like a pro, and get those scores soaring! 🌟👨‍🍳👩‍🍳

Good luck, and may the data be ever in your favor! 🍀📈

# P.S.

Above pipeline is a simple example of how to get started with synnax-lab-sdk and arrive at your first submission.

To improve your scores look into:
1. Macroeconomic data
2. Experiment with other categoric variables encoding options (try individual mean-target-encoding for each target)
3. Deal with outliers
4. More advanced missing values imputation options
5. Different models, individual models for each tartet, hyperparameters tuning
6. Models ensembling (if using different models, make sure you have appropriate processing for each model)
7. Feature selection: not all features in X_train, macro_train can be useful. Try to remove some of them.
8. And of course your creativity. We're confident that your individual approach can beat anything we have layed out in our short tutorial