# Random Baseline

- Date: 04.10.2023
- Maintainer: Jonathan Carona


## Change Log
- Date: 005.10.2023
- By: Josef Rittiner
- Log: Added code for submission to Kaggle

<b>Objective</b>: Develop a random baseline to predict scores of student summaries.

## Imports

In [6]:
import pandas as pd
import numpy as np
import os

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score, ShuffleSplit

In [7]:
# For submission
# path = '/kaggle/input/commonlit-evaluate-student-summaries'

# For testing locally
path = './kaggle/input/commonlit-evaluate-student-summaries'
if os.name == 'nt':
    path = f'.{path}'
elif os.name == 'posix':
    pass

In [8]:
summaries_df = pd.read_csv(f'{path}/summaries_train.csv')
summaries_df

Unnamed: 0,student_id,prompt_id,text,content,wording
0,8a31b8cc1996,3b9047,In the social pyramid of ancient Egypt the pha...,-0.077267,0.424365
1,8c9411cfc953,39c16e,Aristotle claims that an ideal tragedy should ...,0.559070,-0.634924
2,4387107feb4d,3b9047,The ancient Egyptian system of government was ...,1.376083,2.389443
3,d720eb53c270,ebad26,They put pickle in them to mask the smell of r...,0.297031,-0.168734
4,e887883b946c,ebad26,"""whenever meat was so spoiled that it could no...",-0.093814,0.503833
...,...,...,...,...,...
5727,63bb6f3ad628,39c16e,The ideal tragedy should be complex in plot. T...,-0.974242,-0.751414
5728,a341aed41e5a,ebad26,"In paragraph 2 the text states that ""Jonas had...",0.559070,-0.634924
5729,52d42283cb1e,814d6b,The third wave developed quickly because stude...,1.344145,0.835238
5730,2e3df59b996d,39c16e,An ideal tragedy should be complex and should ...,0.873957,0.875453


In [9]:
prompts_df = pd.read_csv(f'{path}/prompts_train.csv')
prompts_df

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...
1,3b9047,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...
2,814d6b,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...
3,ebad26,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."


## Implementing a random regressor

In [10]:
X = summaries_df.merge(prompts_df, on='prompt_id').drop(columns=['prompt_id', 'wording', 'content'])
y = summaries_df[['content', 'wording']]

dummy_regr = DummyRegressor(strategy="median")
dummy_regr.fit(X, y)

cv = ShuffleSplit(n_splits=5, random_state=42)
scores = cross_val_score(dummy_regr, X, y, cv=cv)
scores

array([-0.00138463, -0.00335314, -0.00063743, -0.00027257, -0.00492638])

# Submission

In [11]:
test_prompts_df = pd.read_csv(f'{path}/prompts_test.csv')
test_summaries_df = pd.read_csv(f'{path}/summaries_test.csv')

In [12]:
X = test_summaries_df.merge(test_prompts_df, on='prompt_id').drop(columns=['prompt_id'])

In [13]:
predictions = dummy_regr.predict(X)
predictions_df = pd.DataFrame(predictions, columns=["content", "wording"])
predictions_df = pd.concat([X["student_id"], predictions_df], axis=1)
predictions_df.to_csv('submission.csv', index=False)
display(pd.read_csv('submission.csv'))

Unnamed: 0,student_id,content,wording
0,000000ffffff,-0.093814,-0.081769
1,222222cccccc,-0.093814,-0.081769
2,111111eeeeee,-0.093814,-0.081769
3,333333dddddd,-0.093814,-0.081769
