# Main Problem (Required)

Machine learning is an iterative process. We need to periodically run regression tests to
understand how our predictions are doing. Processing that regression data is key to
improving future predictions. The task is to process the data in the CSV to determine
an accuracy for one specific metric.

The provided dataset contains over 400,000 pitches from the 2021 MLB season. It
includes some extra information, but the most pertinent parts are the reach and
outcome columns. The dataset can be found here. Note: The reach value has been
scrambled to be different from our actual predictions.

You must process the data in python to generate “accuracy” data for what is considered
a “reach” in baseball.

● The outcome column contains the result of that at bat (uniquely identified by the
combination of game_id and event). Any single (X1), double (X2), triple (X3),
homerun (HR), or walk (BB/HBP) would be considered a “reach”.

● If the outcome was one of those, then “reach” was correct/true.

● The reach column contains the percentage chance the machine learning
algorithms predicted the outcome would result in a “reach”.

● If you subtract that probability from 100, it should result in the chance the batter
“doesn’t reach” or “not”.

One simplistic way of gauging correctness (hence the use of “correct”) would be
applying these rules:
a) If “reach” is greater than “not”, and “reach” was the outcome, then our
prediction was “correct” OR
b) If “not” is greater than “reach”, and “reach” was NOT the outcome, then
our prediction was “correct”

In order to generate the accuracy you would need to:
1. Determine if each item was a “reach” or “not”.
2. Determine if our prediction value was “correct”.
3. Determine how many predictions were “correct” out of the total to provide an
“accuracy” of how often the models were correct in predicting “reach”.

## Additional Problems (Optional)

Very few tasks exist within a vacuum. While providing an accuracy number is
wonderful, it does not comprehend how this data would be used in a real environment.
The following are a few ways this problem is extended in real life.
1. How would you turn this into a microservice or automate the running of this?
2. Describe how you would parallelize this for more than one outcome.
3. How would you communicate the results automatically?
4. Add statistical information around the results
5. Add visualization of your results

Answering these questions in documentation format is acceptable, although code is better if you are able.

## Deliverables

You will not be committing this to a repository, but package your solution up like you
would, including a brief README that describes:
1. Your high level approach to the problem.
2. How to run your solution
3. Any design decisions, trade offs, or assumptions you made.
4. Any extensions you have added or would like to add if you had more time.
5. What are the bottlenecks in the process? What ways could make this run faster?
6. Any and all code, tests, or additional benchmarks you write.

In addition to the README and code, please include a brief analysis of the data (what
did you find interesting about the data) and any answers to additional problems.

# Main Problem Solution

In [33]:
import pandas as pd
import numpy as np

In [17]:
df = pd.read_csv(r'sample_data.csv')
print('Dimension of dataframe:', df.shape)

Dimension of dataframe: (414195, 8)


### 1. Determine if each item was a “reach” or “not”.

In [28]:
# Add "is_reach" column (boolean) to dataframe

reaches = ['X1', 'X2', 'X3', 'HR', 'BB', 'HBP']
is_reach = [True if x in reaches else False for x in df['outcome']]
df['is_reach'] = is_reach
df.head()

Unnamed: 0,game_id,event,subevent,outs,balls,strikes,reach,outcome,is_reach
0,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,1.0,0.0,0.0,0.0,25.77,KS,False
1,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,2.0,0.0,0.0,1.0,23.91,KS,False
2,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,3.0,0.0,0.0,2.0,13.06,KS,False
3,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,4.0,0.0,0.0,2.0,13.12,KS,False
4,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,2.0,1.0,1.0,0.0,0.0,28.77,KS,False


### 2. Determine if our prediction value was “correct”.

In [56]:
# Add "prediction_correct" column (boolean) to dataframe

df['prediction_correct'] = np.where((df['reach'] > 50) == df['is_reach'], True, False)
df.head()

Unnamed: 0,game_id,event,subevent,outs,balls,strikes,reach,outcome,is_reach,prediction_correct
0,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,1.0,0.0,0.0,0.0,25.77,KS,False,True
1,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,2.0,0.0,0.0,1.0,23.91,KS,False,True
2,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,3.0,0.0,0.0,2.0,13.06,KS,False,True
3,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,1.0,4.0,0.0,0.0,2.0,13.12,KS,False,True
4,503740b8-b4dc-4e91-8f1a-6971b6fdea5c,2.0,1.0,1.0,0.0,0.0,28.77,KS,False,True


### 3. Determine how many predictions were “correct” out of the total to provide an “accuracy” of how often the models were correct in predicting “reach”.

In [60]:
# Calculate the accuracy of the model correctly predicting reach

print(f"Out of 414,195 predictions, {sum(df.prediction_correct)} predictions were correct. An accuracy of " \
     f"{100 * sum(df.prediction_correct) / len(df)}.")

Out of 414,195 predictions, 282265 predictions were correct. An accuracy of 68.14785306437788.


# Additional Problems (Optional) Solution

### 1. How would you turn this into a microservice or automate the running of this?

This response assumes that the goal of the automation is to recalculate the accuracy of the prediction model during a game using the sample data provided. Also, it is assumed that the updated accuracy value is displayed live during a game. 

One way to automate the running of the accuracy is to implement a cron job. The cron job would perform a recalculation of the accuracy after each pitch of the game. The architecture would include a client, server, cache,  a logs table (to keep track of each cron job), and a data table (to store data on pitches such as the given sample data set). For this explanation, only one client is accounted for. The client being a local tv network showing only the games for one local team.

The client would send a request that contains a cron job to the entrypoint server (server), requesting an accuracy value based off of data collected from previous games. The cron job would be performed once a day during the baseball season. The cron job will also request that the accuracy value be sent to a cache. During the game, the client would store data from each pitch into the cache and recalculate the accuracy value by including the new data. By using the cache, outputting the next average accuracy only has a runtime of O(1) by adding the latest value and dividing. This is as opposed a runtime of about O(n) when each data entry is sent to the data table and the accuracy is recalculated using the entire database. Also, Aanother cron job sent by the client would send all the new data stored in the cache to the data table at the end of each day during the baseball season. 

Each cron job would be logged into the logs table in order to keep track of each cron job action for debugging or system failure purposes. Each row in the logs table accounts for one job. Also, each time that a job runs is a different row. Things included in the logs table would be a time log, duration of each job, the return values, and any logs from the script. Since the assumption is made that there is only one client and the number of jobs performed is small, a queue/scheduler can be used but is not necessary. 


### 2. Describe how you would parallelize this for more than one outcome.

If there is more than one client, with a client being the broadcast of one mlb team's games, then a jobs table, scheduler, and worker pool would also need to be implemented. Cron jobs sent by the client servers would be sent to an entry point server, which then sends that information (along with any associated data such as client ID) to the jobs table. The jobs table stores cron jobs that need to be ran. The scheduler polls the jobs table at a time interval (such as every hour) and identifies jobs that need to be ran. The jobs that need to be ran will be sent to a queue from which the worker pool reads first in first out. Each worker in the worker pool runs a script as designated by the cron job. If the worker runs a script with no issues, a return value is sent indicating a successful run. If a job fails and it is the first time it has failed, the scheduler sends the job back to the queue. If the job fails multiple times, the scheduler checks to see how many times the worker has failed. If the worker has failed many times, the worker is killed and a new worker is created. If the job has failed more than a limited amount of times, an error message is sent the client. The scheduler can poll the log table to see if a job has been running for more than 24 hours. If so, the job is killed. The scheduler would also be able to monitor the size of queue to see if workers should be created or killed. 


### 3. How would you communicate the results automatically?

The results would be shown on the TV screen immediately after each pitch. The client server would send the sequential data to the cache during the live game and retrieve the newly calculated average accuracy after each pitch. 