# Analysis of Premise Selection Data

This notebook requires the data to be saved locally in the main directory. To download the data, run the following commands from the root directory of this project:

```bash
wget math.iisc.ac.in/~gadgil/data/codet5_small_test.zip
unzip codet5_small_test.zip
```

We would like to analyse the data to answer the following questions:

- What is the distribution of the number of premises in the training data?
- What is the distribution of the predicted number of premises in the test data?
- What is the distribution of coverage and efficiency in the test data? It would be nice to see histograms and scatter plots.
- What is the nature of the missed identifiers? In particular, are they rare (i.e., not present in many theorems)?

## Imports

In [None]:
import numpy as np
import plotly.graph_objects as go
import jsonlines

## Loading the data

In [None]:
data = list(jsonlines.open("./../rawdata/premises/identifiers/test_data.jsonl", 'r'))

## Analysis

### Number of premises

In [None]:
#The number of "identifiers" associated to each data point.
premises_count = map(lambda d: len(d['identifiers']), data)

In [None]:
fig = go.Figure(data=[go.Histogram(x=premises_count)])

fig.update_layout(
    title_text='Histogram',
    xaxis=dict(title='Value'),
    yaxis=dict(title='Count'),
)

fig.show()