In [1]:
import pandas as pd
import hands_on1 as ho1

# Hands-on-data #1

The goal of this activity is for you to practice designing and computing features on an educational data set. It is suggested that you complete the assignment using Excel, but you are free to use an equivalent tool if you desire. You will be asked to compute specific features and to develop your own.

For this assignment, use the `“CognitiveTutorAlgebra-gaming-clips.csv”` file.

This file contains a detailed trace of all the interactions that student have within an intelligent tutor called Cognitive Tutor Algebra. This is an intelligent tutor where students need to complete multiple steps in order to solve a problem. Each action is a row in the log file. Each of those actions can be of 4 types: HELP, RIGHT, WRONG, BUG (see the "assessment" field).

- HELP is when the student requests a hint.
- RIGHT is when the student correctly completes one of the step
- BUG is when the student's action is incorrect but is a known error (a bug in the student's problem solving process). In such a case, the student is presented with a message explaining the error.
- WRONG is when the student's action is incorrect and that action is not a known error (there is no error message associated to it).
Regarding the "action" field. In this dataset there are different kinds of problem. Most of them are just filling in tables/text fields. In those cases, the “action” field is going to be blank.

There are a some problems where the student has an equation and need to isolate a term; for example, 2*x = 8. In that case, the right thing to do would be to divide both sides of the equation by 2. When the students does that, the “action” field should be 'divide' and “input” should be '2'.

The "context" field, is the element in the user interface that the student is interacting it. For example, you'll see a lot of cases where the context is something of the form 'R1C2'. This would happen when the student interacts with a table and fills in one of the cell. In this example, row 1 and column 2.

You'll also see cases where the context is the equation the student is manipulating.

When you look at the "time" column, you'll notice that it sometimes have the value '-1'. This happens when the action is at the start of a problem or some other similar situation where it’s not possible to look at the previous action. Note that, in the dataset that you have, you don’t have information about the problem id.

Finally, the assignment asks you to use the 5 action clips as the grain size. The last column of the data set is called “Gaming clip”. In this dataset, clips of 5 actions have been randomly selected to identify whether students appear to be *gaming the system during those clips.

*Gaming the system is a disengaged behavior in which students abuse the help functionalities of the tutor to complete problems without having to understand how to solve it. Examples of gaming behavior include, systematically guessing the answers and abusing help requests.

This column is mostly blank, this indicates that the action is not part of one of the randomly selected clips of 5 actions; that is, the action is not labeled yet. Otherwise, you will notice a number. This number is the unique id of one of the selected clips. Each line associated to that number is one of the actions in that clip of 5 actions.

You will need to compute features that are aggregated at the clip level (i.e., 5-action clip). This means that you will need to compute the value of the feature within this 5-action clip. The resulting data table should have 1 row per the clip level, and each additional column should be one of the computed features. 

Address the following questions:

## Question 1 (3 point): Compute each of the following 3 features:
- AvgTimePerAction: The average time, in seconds, spent on each action included in the clip.
- #ContextChanges: The number of times the context (element of the user interface) in which the action took place was different from the previous actions for each action of the clip.
- #CorrectActions: The number of times the student executed a correct action within the clip.
## Question 2 (2 point): Develop and compute 2 features that might be useful when trying to identify the gaming the system behavior. Provide a meaningful name and a description for both features you created.
Then, submit the results as follows: 

Click 'Reply' in the text box below.
Write a description for the features you created for Question 2.
Attach the Excel file(s) that includes traces of your computation (e.g. you will need to save your files as .xlsx to keep all the formulas and pivot tables you added in Excel).

In [2]:
data=pd.read_csv('./data/CognitiveTutorAlgebra-gaming-clips.csv')

# Question 1

## AvgTimePerAction

The average time, in seconds, spent on each action included in the clip.

In [3]:
ho1.get_avg_time_per_action(data).to_csv('./data/ho1_avg_time.csv')

## ContextChanges

The number of times the context (element of the user interface) in which the action took place was different from the previous actions for each action of the clip.

In [4]:
ho1.context_changes(data).to_csv('./data/ho1_context_changes.csv')

## CorrectActions

The number of times the student executed a correct action within the clip.

In [5]:
ho1.correct_actions(data).to_csv('./data/ho1_correct_actions.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gaming_data['#CorrectActions'] = (


# Question 2

## Time Features

### TotalTimeToSolve

How long each student took to solve an assignment.

Students with a quick solving time could indicate gaming behavior.

### AverageSolveTime

Average time students took to solve each assignment.

This is used to assist in the next calculation

### DeviationFromMeanTime

`TotalTimeToSolve - AverageSolveTime`

`TotalTimeToSolve` doesn't necessarily give a good indication of whether a student is solving fast, so we calculate the distance between the student's solve time and the average solve time for each lesson.

In [6]:
ho1.total_time_to_solve(data).to_csv('./data/ho1_TotalTimeToSolve.csv')

In [7]:
ho1.average_solve_time(data).to_csv('./data/ho1_AverageSolveTime.csv')

In [8]:
ho1.deviation_mean_solve_time(data).to_csv('./data/ho1_DeviationFromMeanTime.csv')

 -3724.44444444 -2197.44444444  4585.55555556  3760.55555556
  1409.55555556 -7108.44444444  1252.55555556  -974.44444444
 -7278.44444444   362.55555556  5002.55555556 -4931.44444444
  1755.55555556 -1650.44444444  4731.55555556 -2987.44444444
 -4158.44444444  4801.55555556 -4854.44444444  4613.55555556
  5139.55555556  1455.55555556  5051.55555556  1273.55555556
   340.55555556 -4038.44444444  4874.55555556 -6221.44444444
   647.55555556 -1704.44444444   762.55555556  2692.55555556]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  tts.loc[tts['lesson'] == lesson, 'DeviationFromMean'] =\
  tts.loc[tts['lesson'] == lesson, 'AverageSolveTime'] =\


# Action Features

## Wrong Count

Sum of `WRONG + BUG` assessments per action.

## Correct Count

Total correct count per action.

## Incorrect Ratio

Ratio of `Wrong Count / Correct Count`



In [23]:
c=data.copy()
c['Wrong Count'] = 0
c['Correct Count'] = 0
c['Help Count'] = 0

In [25]:
c.loc[c['assessment']=='WRONG', 'Wrong Count'] = 1
c.loc[c['assessment']=='BUG', 'Wrong Count'] = 1
c.loc[c['assessment']=='HELP', 'Help Count'] = 1
c.loc[c['assessment']=='RIGHT', 'Correct Count'] = 1

In [19]:
c=c.groupby('action').sum(numeric_only=True)
c['Incorrect Percent'] = c['Wrong Count'] / c['Correct Count']
c

Unnamed: 0_level_0,Row ID,time,Gaming clip,Wrong Count,Correct Count,Help Count,Incorrect Percent
action,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
BLANK,56968136950,3923639,9750438000.0,79331,175651,34588,0.311124
add,1694007042,146671,155890900.0,2593,5916,0,0.304736
aproot,117805009,6396,1925014.0,144,623,0,0.187744
clt,3047890309,133046,227780100.0,2205,10862,0,0.168746
distribute,546740237,59680,50697640.0,380,2652,0,0.12533
divide,3727537537,239753,435330400.0,3299,15467,0,0.175797
expon,119670666,8984,2644196.0,119,619,0,0.161247
fact,279998182,32077,0.0,380,1211,0,0.238843
fq,112875058,18811,0.0,196,589,0,0.249682
ivm,7899779,567,234859.0,22,8,0,0.733333
