# Automatically generate prediction problems for the Covid dataset with Trane

In this tutorial, we will show how we can use Trane to generate predictions problems for the Covid dataset. 

## Load Data
First, let's load our data, and examine the first few rows.

In [1]:
import trane

data = trane.datasets.load_covid()
data.head(5)

Unnamed: 0,Country/Region,Date,Province/State,Lat,Long,Confirmed,Deaths,Recovered
0,Afghanistan,2020-01-22,0,33.0,65.0,0,0,0
1,Monaco,2020-01-22,0,43.7333,7.4167,0,0,0
2,Mongolia,2020-01-22,0,46.8625,103.8467,0,0,0
3,Montenegro,2020-01-22,0,42.5,19.3,0,0,0
4,Morocco,2020-01-22,0,31.7917,-7.0926,0,0,0


In [2]:
print(f"Number of Rows: {data.shape[0]}")

Number of Rows: 17136


As we can see, this a dataset of Covid cases. We have information on by different Countries, as well as the date of that information. 

We are required to determine the following parameters to generate the CutoffStrategy

**entity_col**: the column name to use for grouping the data.
- For this walkthrough, we are interested interested in prediction problems for each `Country/Region`.

**window_size**: the amount of data to use per label
- We will set this at `2d`, to account for the delay in reporting Covid information. 

**minimum_size**: the time at which the labeling should begin
 - We want to use all avaliable information for labeling: set the `minimum_size` to the timestamp of the oldest data point 

**maximum_size**: the time at which the labeling will end
 - We want to create labels for all data points: set the `maximum_size` to be the timestamp of the most recent data point. 


In [3]:
entity_col = "Country/Region"
window_size = "2d"
minimum_data = "2020-01-22"
maximum_data = "2020-03-29"
cutoff_strategy = trane.CutoffStrategy(
    entity_col=entity_col,
    window_size=window_size,
    minimum_data=minimum_data,
    maximum_data=maximum_data,
)

We now have a cutoff_strategy we can use to generate prediction problems.

Next, we need to 


In [6]:
time_col = "Date"

problem_generator = trane.PredictionProblemGenerator(
    entity_col=entity_col,
    time_col=time_col,
    cutoff_strategy=cutoff_strategy,
    table_meta=table_meta,
)
problems = problem_generator.generate(data, generate_thresholds=True)

  0%|          | 0/1044 [00:00<?, ?it/s]

Success/Attempt = 514/1044


In [7]:
prediction_problem_to_label_times = {}
for idx, problem in enumerate(problems):
    problem_sentence = str(problem)
    prediction_problem_to_label_times[problem_sentence] = problem.execute(
        data, -1, verbose=False
    )

In [9]:
len(problems)

514

In [10]:
picked_indexes = [1, 50, 200, 300, 400]
for idx, problem in enumerate(problems[i] for i in picked_indexes):
    problem_sentence = str(problem)
    print(f"{problem_sentence}")
    print("----")

print(f"\nTotal Number of Prediction Problems = {len(problems)}")

For each <Country/Region> predict the number of records with <Lat> greater than 41.2956 in next 2d days
----
For each <Country/Region> predict the total <Confirmed> in all related records with <Long> greater than 84.25 in next 2d days
----
For each <Country/Region> predict the average <Confirmed> in all related records with <Recovered> greater than 0 in next 2d days
----
For each <Country/Region> predict the maximum <Deaths> in all related records with <Long> greater than -23.0418 in next 2d days
----
For each <Country/Region> predict the minimum <Long> in all related records with <Lat> greater than 23.7 in next 2d days
----

Total Number of Prediction Problems = 514


In [11]:
problem = problems[0]
problem_sentence = str(problem)
label_times = problem.execute(data, -1, verbose=False)
print(problem_sentence, "\n")
print(label_times.head(5))

For each <Country/Region> predict the number of records in next 2d days 

  Country/Region       time  _execute_operations_on_df
0    Afghanistan 2020-01-22                          2
1    Afghanistan 2020-01-24                          2
2    Afghanistan 2020-01-26                          2
3    Afghanistan 2020-01-28                          2
4    Afghanistan 2020-01-30                          2


In [12]:
ft_wrapper = trane.FeaturetoolsWrapper(
    df=data, entity_col=entity_col, time_col=time_col, name="covid"
)
feature_matrix, features = ft_wrapper.compute_features(label_times, cutoff)
for feature in features[:5]:
    print(feature)

NameError: name 'cutoff' is not defined

In [None]:
feature_matrix.head(5)

In [None]:
feature_matrix_encoded, features_encoded = ft_wrapper.encode_features(
    label_times, cutoff
)

In [None]:
label_times.head(5)

In [None]:
print(prediction_problem_to_label_times[0])

In [None]:
print(prediction_problem_to_label_times[0])
    print(problem_str)
    label_times = prediction_problem_to_label_times[problem_str]
    print(label_times.head(3))

In [None]:
from trane.utils import multiprocess_prediction_problem

prediction_problem_to_label_times = multiprocess_prediction_problem(problems, df)

In [None]:
for problem_str in prediction_problem_to_label_times:
    print(problem_str)
    label_times = prediction_problem_to_label_times[problem_str]
    print(label_times.head(3))