# Classification with Rubrix

If you're here, you probably already know that Rubrix is an annotation tool.

In a way, Rubrix has three parts: A server with the annotation UI, an elasticsearch Docker instance that stores the data, and a python library that can interact with the server and register data to be annotated.

Getting to know Rubrix, we will try to annotate some english-language job advertisements. These job ads are stored in a tabular file. We will start the rubrix server and elastic container, and use python to read our dataset and register it with the server.

## Running the server
Assuming you have Docker installed, you can start a docker container with elasticsearch directly from the notebook. You don't have to know a lot about Docker, but if this is new to you remember to finish the "cleaning up" part at the bottom. For now, this should work.

In [12]:
%%bash
docker run -d \
    --name rubrix-es \
    -p 9200:9200 -p 9300:9300 \
    -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
    -e "discovery.type=single-node" \
    docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2

51ff13143913f46fb4abec2fc58183554ab5ae09cece7df66694b4fe4c14c269


In addition to the server, you need to start the rubrix server. This runs directly in Python (no Docker required), and can be started by running `python -m rubrix` in the command line. Because this command will occupy the terminal, it is a good idea to do this in an new terminal.

## Loading the data

Now that the server is running, we are able to add data that we want to annotate to it. First though, we use Pandas to read the data from file.

In [23]:
import pandas as pd
import rubrix as rb

df = pd.read_csv('../data/job_ads_english.csv', dtype='str')

In [24]:
df.head(10)

Unnamed: 0,job_title,job_description,isco_code,nace
0,Deck Bosun,# is overall responsible and in charge of all ...,8350,3111
1,Postdoctoral Research Fellow (ref. 2012/2293),# in #\n\nA three-years position as Post doc (...,2131,85421
2,Welder Offshore ( Sveiser offshore),# performs maintenance and fabrication welding...,7212,9101
3,Data collection Stavanger/Sandnes,\nWe need someone to do some data collection i...,4132,73200
4,Piping Engineer,IKM Consultants AS is a company within the IKM...,3117,71129
5,Post-Doctoral Research Fellowship in Cognitive...,Department of Psychology\n\n# in #\n\nBackgrou...,2310,85421
6,Medarbeidere (deltid),REMA 1000 i Tønsberg området vil vi få behov f...,5223,47111
7,Business Development Executive,Executive placement in DOF Subsea:\nSearching ...,2413,71122
8,Project Engineer (ref.nr. 404281),Responsibilities * Accountable to deliver resp...,3117,71129
9,Cost Controller TechnipFMC,#FMCTechnipFMC is a global leader in oil and g...,2411,30113


In [25]:
df_sample = df.sample(100)

In [26]:
records = []

for i, r in df_sample.iterrows():
    record = rb.TextClassificationRecord(
        inputs={
            "title": r['job_title'],
            "text": r['job_description']
        }
    )

    records.append(record)

In [27]:
rb.log(records, name = "job_ads_example_1", verbose=False)


BulkResponse(dataset='job_ads_example_2', processed=100, failed=0)

When this is finished, you should be able to find the data by going to http://localhost:6900. Note that modern browsers like to add https to everything, so make sure you type http to avoid errors and confusion. 

I'm not going to explain the user interface here, but you should be able to create a few labels, and effectively label the different job ads. If you are in need of inspiration, try to label wether the jobs can be done from home. You will notice that some of the descriptions are uninformative or not in english, others are very technical and therefore hard to evaluate, and some jobs can be done partially from home. You might want to create labels for these as well.

When you are done, you can use rubrix to retrieve the annotated data again.

In [28]:
annotated_df = rb.load('job_ads_example_1', query='status:Validated')

In [29]:
len(annotated_df)

15

In [30]:
annotated_df.head()

Unnamed: 0,inputs,prediction,prediction_agent,annotation,annotation_agent,multi_label,explanation,id,metadata,status,event_timestamp,metrics
0,{'text': '– # – Powder - Norway # R&D Powder ...,,,hybrid,rubrix,False,,0067f2a0-9cfb-4b86-a0c3-be99b8100b2d,{},Validated,,{}
1,{'text': 'DNB Markets - # seeks: Analysts and...,,,unknown,rubrix,False,,030c4e1b-2a1c-42bf-af5e-f3bd362b7316,{},Validated,,{}
2,{'text': '00690 Responsibilities * Perform PD...,,,hybrid,rubrix,False,,044ff868-45af-45e0-9a7e-7a233294f948,{},Validated,,{}
3,{'text': 'Jobbnorge ID: 111500A scholarship is...,,,yes,rubrix,False,,0548f6e2-87ed-4512-911b-f1c5d7cbc768,{},Validated,,{}
4,{'text': 'Discipline : Instrument Project : S...,,,hybrid,rubrix,False,,05aae560-3fe0-45f7-a4a9-973a46de2a28,{},Validated,,{}
