## Introduction to Dataset Processing
#### Carl Shan

This Jupyter Notebook will share more details about how to process your data. Data processing is like preparing the ingredients before cooking; if you prepare them poorly (e.g., leave things half-peeled and dirty) , the meal will taste poor no matter how skillful a chef you are. 

It's similarly true in machine learning. Dataset processing can be one of the most important things you can do to get your model to perform well.

#### Introducing some helpful "magic" Jupyter commands
? - this will bring up the documentation of a function

In [4]:
import pandas as pd
from sklearn import preprocessing

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [5]:
data = pd.read_csv('student/student-mat.csv', sep=';')

In [6]:
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


#### Converting Categorical Values to Numerical Ones

Looking at the data above, we want to convert a number of the columns from categorical to numerical. Most machine learning models deal with numbers and don't know how to model data that is in text form. As a result we need to learn how to do things such as e.g., convert the values in the `school` column to numbers.

#### First, let's see what values there are in the `school` column

In [7]:
# This shows a list of unique values and how many times they appear
data['school'].value_counts()

GP    349
MS     46
Name: school, dtype: int64

In [27]:
# Converting values in the school column to text
# We are going to define a function that takes a single value and apply it to all the values
def convert_school(row):
    if row == 'GP':
        return 0
    elif row == 'MS':
        return 1
    else:
        return None

In [26]:
# Here's a slow way of using the above function
%time #this can show you the time it took
converted_data = []

for row in data['school']:
    new_value = convert_school(row)
    converted_data.append(new_value)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


In [28]:
print(converted_data)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [29]:
%time
converted_data = data['school'].apply(convert_school) #takes in a function that is applied to every row, 
#and applied to that column, but instead of doing it one by one, it does all the rows at the same time

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 3.81 µs


AttributeError: 'list' object has no attribute 'head'

#### Using sklearn's built-in preprocessing module, we can do the same thing

In [12]:
enc = preprocessing.LabelEncoder() #this is prebuilt and helps us 

In [13]:
transformed = enc.fit_transform(data['school'])  

In [14]:
transformed

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

#### Dealing with Null values

To show you how to deal with null values, I'm going to make some simulated data of students.

In [30]:
grades = np.random.choice(range(1, 13), 100) # chooses 100 random numbers between 1 - 12
num_friends_or_none = list(range(0, 20)) + [None] * 5
num_friends = np.random.choice(num_friends_or_none, 100)
new_data = pd.DataFrame(data={'Grade': grades, '# Friends': num_friends})

In [31]:
new_data.head(n=20)

Unnamed: 0,# Friends,Grade
0,9.0,3
1,19.0,12
2,0.0,11
3,0.0,4
4,16.0,7
5,6.0,7
6,12.0,11
7,17.0,10
8,3.0,1
9,19.0,7


#### One way to deal with null values is to drop them

In [17]:
new_data['# Friends'].dropna()#returns a dataset with all the null values gone - won't actually replace the original data

0      7
1     10
2     12
6     14
8      7
9     13
10    10
11    13
12    13
13     7
14    10
15    15
16    14
17     3
19    15
20     7
21     1
22     2
23     3
24    19
25    15
26     7
27     3
28    12
29     1
30    10
31     1
32     0
33    14
36     2
      ..
65     3
67     7
68    14
69    18
70    17
71    18
72    14
73    10
74     5
75    12
76     7
78     9
79    11
80    19
81     8
82    12
84    19
85    13
86     2
87     3
88    15
89     9
90    11
91     9
93     6
94     9
95    17
96    12
98     4
99    18
Name: # Friends, Length: 81, dtype: object

In [32]:
dropNas = pd.DataFrame()
dropNas['# Friends'] = new_data['# Friends'].dropna()

In [18]:
average_friends = new_data['# Friends'].mean()
new_data['# Friends'].fillna(average_friends) #replacing that column with a number of your choosing (fillna() method)

0      7.000000
1     10.000000
2     12.000000
3      9.777778
4      9.777778
5      9.777778
6     14.000000
7      9.777778
8      7.000000
9     13.000000
10    10.000000
11    13.000000
12    13.000000
13     7.000000
14    10.000000
15    15.000000
16    14.000000
17     3.000000
18     9.777778
19    15.000000
20     7.000000
21     1.000000
22     2.000000
23     3.000000
24    19.000000
25    15.000000
26     7.000000
27     3.000000
28    12.000000
29     1.000000
        ...    
70    17.000000
71    18.000000
72    14.000000
73    10.000000
74     5.000000
75    12.000000
76     7.000000
77     9.777778
78     9.000000
79    11.000000
80    19.000000
81     8.000000
82    12.000000
83     9.777778
84    19.000000
85    13.000000
86     2.000000
87     3.000000
88    15.000000
89     9.000000
90    11.000000
91     9.000000
92     9.777778
93     6.000000
94     9.000000
95    17.000000
96    12.000000
97     9.777778
98     4.000000
99    18.000000
Name: # Friends, Length:

In [19]:
new_data['# Friends'] = new_data['# Friends'].fillna(average_friends)

#### Now let's learn how to standardize data
By that I mean to transform our data so that it has a mean of 0 and a standard deviation of 1

In [36]:
from sklearn.preprocessing import StandardScaler #so that it's easier to look at your data - so that your data is centered at 0
#helps remove the issue of having super high ranges

In [37]:
scaler = StandardScaler()

In [39]:
scaler.fit_transform(dropNas['# Friends'].values.reshape(-1,1)) #-1 means if you don't know the number of rows, so 
#you ask the computer to find the most optimal



array([[-0.122034  ],
       [ 1.57403176],
       [-1.64849319],
       [-1.64849319],
       [ 1.06521203],
       [-0.63085373],
       [ 0.38678573],
       [ 1.23481861],
       [-1.13967346],
       [ 1.57403176],
       [ 1.40442519],
       [ 0.04757258],
       [-0.80046031],
       [-0.122034  ],
       [ 1.57403176],
       [ 0.38678573],
       [ 1.57403176],
       [ 1.57403176],
       [ 0.55639231],
       [-0.46124715],
       [ 1.23481861],
       [-1.30928003],
       [ 0.21717915],
       [ 0.72599888],
       [ 0.38678573],
       [ 0.55639231],
       [-1.13967346],
       [-1.47888661],
       [-0.122034  ],
       [-0.122034  ],
       [ 1.40442519],
       [-0.97006688],
       [-0.63085373],
       [ 1.40442519],
       [ 0.89560546],
       [-1.47888661],
       [ 0.38678573],
       [ 1.23481861],
       [ 1.40442519],
       [ 0.72599888],
       [-1.13967346],
       [ 0.21717915],
       [-0.46124715],
       [-0.97006688],
       [ 0.72599888],
       [-1