# Data Cleanup
in Predicting Student Admissions with Neural Networks

In this notebook, we predict student admissions to graduate school at UCLA based on three pieces of data:
- GRE Scores (Test)
- GPA Scores (Grades)
- Class rank (1-4)

The dataset originally came from here: http://www.ats.ucla.edu/

## Loading the data
To load the data and format it nicely, we will use two very useful packages called Pandas and Numpy.

In [29]:
# Importing pandas and numpy
import pandas as pd
import numpy as np

# Reading the csv file into a pandas DataFrame
data = pd.read_csv('A2_student_data.csv')

# Printing out the first 10 rows of our data
data[:10]

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4
5,1,760,3.0,2
6,1,560,2.98,1
7,0,400,3.08,2
8,1,540,3.39,3
9,0,700,3.92,2


We should one-hot encode it.

## One-hot encoding the rank
Use the `get_dummies` function in pandas in order to one-hot encode the data.

Hint: To drop a column, it's suggested that you use `one_hot_data`[.drop( )](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html).

In [37]:
data[:5]

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [38]:
# TODO:  
# (1) Make dummy variables for rank 
# (2) and concat existing columns
# (3) Drop the previous rank column
# Print the first 10 rows of our data

In [39]:
##### Up-to-date version of pandas #####
# (1) Make dummy variables for rank, (2)concat existing columns, and
# (3) drop the previous rank column
one_hot_data = pd.get_dummies(data, columns=['rank'])

# Print the first 10 rows of our data
one_hot_data[:10]

Unnamed: 0,admit,gre,gpa,rank_1,rank_2,rank_3,rank_4
0,0,380,3.61,0,0,1,0
1,1,660,3.67,0,0,1,0
2,1,800,4.0,1,0,0,0
3,1,640,3.19,0,0,0,1
4,0,520,2.93,0,0,0,1
5,1,760,3.0,0,1,0,0
6,1,560,2.98,1,0,0,0
7,0,400,3.08,0,1,0,0
8,1,540,3.39,0,0,1,0
9,0,700,3.92,0,1,0,0


In [40]:
##### Alternatevely, #####

In [41]:
# (1) Make dummy variables for rank 
dummies=pd.get_dummies(data['rank'], prefix='rank')
dummies[:10]

Unnamed: 0,rank_1,rank_2,rank_3,rank_4
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
5,0,1,0,0
6,1,0,0,0
7,0,1,0,0
8,0,0,1,0
9,0,1,0,0


In [42]:
# (2) and concat existing columns
one_hot_data = pd.concat([data, dummies], axis = 1)
one_hot_data[:10]

Unnamed: 0,admit,gre,gpa,rank,rank_1,rank_2,rank_3,rank_4
0,0,380,3.61,3,0,0,1,0
1,1,660,3.67,3,0,0,1,0
2,1,800,4.0,1,1,0,0,0
3,1,640,3.19,4,0,0,0,1
4,0,520,2.93,4,0,0,0,1
5,1,760,3.0,2,0,1,0,0
6,1,560,2.98,1,1,0,0,0
7,0,400,3.08,2,0,1,0,0
8,1,540,3.39,3,0,0,1,0
9,0,700,3.92,2,0,1,0,0


In [43]:
# (3) Drop the previous rank column
one_hot_data = one_hot_data.drop('rank', axis = 1)

# Print the first 10 rows of our data
one_hot_data[:10]

Unnamed: 0,admit,gre,gpa,rank_1,rank_2,rank_3,rank_4
0,0,380,3.61,0,0,1,0
1,1,660,3.67,0,0,1,0
2,1,800,4.0,1,0,0,0
3,1,640,3.19,0,0,0,1
4,0,520,2.93,0,0,0,1
5,1,760,3.0,0,1,0,0
6,1,560,2.98,1,0,0,0
7,0,400,3.08,0,1,0,0
8,1,540,3.39,0,0,1,0
9,0,700,3.92,0,1,0,0


## Scaling the data
The next step is to scale the data. We notice that the range for grades is 1.0-4.0, whereas the range for test scores is roughly 200-800, which is much larger. This means our data is skewed, and that makes it hard for a neural network to handle. Let's fit our two features into a range of 0-1, by dividing the grades by 4.0, and the test score by 800.

In [44]:
# Making a copy of our data
processed_data = one_hot_data[:]

In [45]:
# TODO: Scale the columns # Z-score can be alternative
processed_data['gre'] = processed_data['gre']/800
processed_data['gpa'] = processed_data['gpa']/4.0

# Printing the first 10 rows of our procesed data
processed_data[:10]

Unnamed: 0,admit,gre,gpa,rank_1,rank_2,rank_3,rank_4
0,0,0.475,0.9025,0,0,1,0
1,1,0.825,0.9175,0,0,1,0
2,1,1.0,1.0,1,0,0,0
3,1,0.8,0.7975,0,0,0,1
4,0,0.65,0.7325,0,0,0,1
5,1,0.95,0.75,0,1,0,0
6,1,0.7,0.745,1,0,0,0
7,0,0.5,0.77,0,1,0,0
8,1,0.675,0.8475,0,0,1,0
9,0,0.875,0.98,0,1,0,0


## Splitting the data into Training and Testing

In order to test our algorithm, we'll split the data into a Training and a Testing set. The size of the testing set will be 10% of the total data.

In [55]:
# np.random.seed(42) # in case we want to fix random seed

sample = np.random.choice(processed_data.index, 
                          size=int(len(processed_data)*0.9), 
                          replace=False)

# train_data, test_data = processed_data.iloc[sample], processed_data.drop(sample)
train_data = processed_data.iloc[sample]
test_data = processed_data.drop(sample)

In [57]:
print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))

print(train_data[:10])
print(test_data[:10])

Number of training samples is 360
Number of testing samples is 40
     admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
172      0  0.850  0.8700       0       0       1       0
137      0  0.875  1.0000       0       0       1       0
126      1  0.750  0.8850       1       0       0       0
94       1  0.825  0.8600       0       1       0       0
72       0  0.600  0.8475       0       0       0       1
33       1  1.000  1.0000       0       0       1       0
380      0  0.875  0.9125       0       1       0       0
223      0  1.000  0.8675       0       0       1       0
307      0  0.725  0.8775       0       1       0       0
227      0  0.675  0.7550       0       0       0       1
     admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
20       0  0.625  0.7925       0       0       1       0
21       1  0.825  0.9075       0       1       0       0
48       0  0.550  0.6200       0       0       0       1
50       0  0.800  0.9650       0       0       1       0
54    

## Splitting the data into features and targets (labels)
Now, as a final step before the training, we'll split the data into features (X) and targets (y).

In [47]:
# Create 'features' by dropping the admin column
# Create 'targets' only with the admin column

features = train_data.drop('admit', axis=1)       # gre, gpa, rank information
targets = train_data['admit']                     # admitted or not // which you want to gain 

features_test = test_data.drop('admit', axis=1)
targets_test = test_data['admit']

In [48]:
print(features[:10])

       gre     gpa  rank_1  rank_2  rank_3  rank_4
172  0.850  0.8700       0       0       1       0
137  0.875  1.0000       0       0       1       0
126  0.750  0.8850       1       0       0       0
94   0.825  0.8600       0       1       0       0
72   0.600  0.8475       0       0       0       1
33   1.000  1.0000       0       0       1       0
380  0.875  0.9125       0       1       0       0
223  1.000  0.8675       0       0       1       0
307  0.725  0.8775       0       1       0       0
227  0.675  0.7550       0       0       0       1


In [49]:
print(targets[:10])

172    0
137    0
126    1
94     1
72     0
33     1
380    0
223    0
307    0
227    0
Name: admit, dtype: int64
