# Notebook 1 - Grade Normalization

### Each essay set has a different resolved score range:

- Essay set 1: 2 - 12
- Essay set 2: 1 - 6
- Essay set 3: 0 - 3
- Essay set 4: 0 - 3
- Essay set 5: 0 - 4
- Essay set 6: 0 - 4
- Essay set 7: 0 - 30
- Essay set 8: 0 - 60


In [1]:
# Import pandas and numpy data structures to handle the data.
import pandas as pd
import numpy as np

### Loading the data:

In [2]:
df = pd.read_csv('essay_scoring.csv', encoding='latin-1')
df = df[np.isfinite(df['domain1_score'])]

In [3]:
df.head()

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,domain1_score
0,1,1,"Dear local newspaper, I think effects computer...",4.0,4.0,8.0
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5.0,4.0,9.0
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4.0,3.0,7.0
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5.0,5.0,10.0
4,5,1,"Dear @LOCATION1, I know having computers has a...",4.0,4.0,8.0


### Remove unwanted columns from dataframe: 

In [4]:
df = df.drop('essay_id',1)
df = df.drop('rater1_domain1',1)
df = df.drop('rater2_domain1',1)

### Creating separate dataframe for each essay set:

In [5]:
df_1 = df[df['essay_set'].values == 1]
df_2 = df[df['essay_set'].values == 2]
df_3 = df[df['essay_set'].values == 3]
df_4 = df[df['essay_set'].values == 4]
df_5 = df[df['essay_set'].values == 5]
df_6 = df[df['essay_set'].values == 6]
df_7 = df[df['essay_set'].values == 7]
df_8 = df[df['essay_set'].values == 8]

In [6]:
# Function to normalize the grades for all sets. So all sets will have a score between 0 - 10.
def normalize(df, min, max):
    
    a = 1.0/(max-min)
    b = float(-min)/(max-min)
    
    df['domain1_score'] *= a
    df['domain1_score'] += b
    df['domain1_score'] *= 10

### Normalizing the domain1_score for each essay set:

In [7]:
import warnings
warnings.filterwarnings('ignore')

normalize(df_1, 2, 12)
normalize(df_2, 1, 6)
normalize(df_3, 0, 3)
normalize(df_4, 0, 3)
normalize(df_5, 0, 4)
normalize(df_6, 0, 4)
normalize(df_7, 0, 30)
normalize(df_8, 0, 60)

In [8]:
df_8.head()

Unnamed: 0,essay_set,essay,domain1_score
12255,8,A long time ago when I was in third grade I h...,5.666667
12256,8,Softball has to be one of the single most gre...,7.666667
12257,8,"Some people like making people laugh, I love ...",6.666667
12258,8,"""LAUGHTER"" @CAPS1 I hang out with my friends...",5.0
12259,8,Well ima tell a story about the time i got @CA...,4.333333


### Combining the dataframes:

In [9]:
x1 = np.around(df_1['domain1_score'].values, decimals = 2).tolist()
x2 = np.around(df_2['domain1_score'].values, decimals = 2).tolist()
x3 = np.around(df_3['domain1_score'].values, decimals = 2).tolist()
x4 = np.around(df_4['domain1_score'].values, decimals = 2).tolist()
x5 = np.around(df_5['domain1_score'].values, decimals = 2).tolist()
x6 = np.around(df_6['domain1_score'].values, decimals = 2).tolist()
x7 = np.around(df_7['domain1_score'].values, decimals = 2).tolist()
x8 = np.around(df_8['domain1_score'].values, decimals = 2).tolist()

In [10]:
x = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

In [11]:
domain1_score_df = pd.DataFrame(
    {
    'domain1_score': x
    })

In [12]:
domain1_score_df.shape

(12977, 1)

In [13]:
domain1_score_df.head()

Unnamed: 0,domain1_score
0,6.0
1,7.0
2,5.0
3,8.0
4,6.0


In [14]:
df = df.drop('domain1_score',1)

In [15]:
df['domain1_score'] = domain1_score_df['domain1_score'].values

In [16]:
df.head()

Unnamed: 0,essay_set,essay,domain1_score
0,1,"Dear local newspaper, I think effects computer...",6.0
1,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",7.0
2,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",5.0
3,1,"Dear Local Newspaper, @CAPS1 I have found that...",8.0
4,1,"Dear @LOCATION1, I know having computers has a...",6.0


In [17]:
# Checking for null values in the dataset.
print("Nulls in dataset: {0} => {1}".format(df.columns.values, df.isnull().any().values))

Nulls in dataset: ['essay_set' 'essay' 'domain1_score'] => [False False False]


### Saving the normalized scores into a file

In [18]:
df.to_csv('data.csv')