### Welcome to the jungle . . . we have pandas here.

Welcome to part II! So let's dig a bit into two crucial libraries that machine learning nerds like myself use everyday to convert data into useful chunks of information for a nerual network to learn. Let's start by importing these libraries:

In [25]:
import pandas as pd
import numpy as np


So you're probably thinking at this point, 

"whoah! Slow down crazy person! What's up with the "as pd" and "as np" after the import statement?" 

The answer is programmers are lazy. By adding the "as < Variable >" bit after the import statement, we no longer have to write out the whole name of the library whenever we steal a function from it. Let's talk a little about how that works with an example:


In [26]:
df_data = pd.read_table('./data-sets/adult.data.txt', sep=',', skipinitialspace=True)


What we did up here is (1) declare a new variable "df_data" to hold onto all the contents of the file we just loaded into python. Pandas is great, but it's a little persnickety about knowing how data is separated into cells in the document we're loading up, so we (2) set the separator paramter to ',' to basically indicate that we want commas to separate our different datasets. Cool. Lastly, taking a quick look at our data, you'll notice that there's a space before every word and number followed by a column. That space? Not very useful. So we edit it out with (3) setting the parameter for skipinitialspace to "True".

Pandas usually needs at least one more parameter to capture all the data in your data table correctly--the "names" parameter, which takes a list of strings to be used as column headers--something we'll need later to look at the data in our new df_data variable. If your document already has headers, feel free to leave this variable out--Pandas will infer what the names are by simply look at the first row in your data sheet and saying "Ah! these are the headers!" But if you don't have headers, you'll need to (a) build a list of headers yourself and (b) tell pandas to use that list as headers, like so . . .


In [27]:
#There are a lot of columns to this sheet, and so it's good to at least
# put their names here for reference later on. We'll do the same with the
# Deep Neural Network we'll build in the next section.
COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "LABELS"
]

df_data = pd.read_table('./data-sets/adult.data.txt', sep=',', skipinitialspace=True, names=COLUMNS)



But! I already put the column names in there for you, so no need to define them again. Still, let's hold onto these names for later--we'll need them to look up the data in individual columns in the glorified excel sheet we're using for data . . . like right now! So let's say that you want to see all the data in just one column of df_data--our pythonized version of our glorified excel sheet. You can simply type the following code to get a printout of what's in that column.


In [29]:
df_data['age']

0        age
1         39
2         50
3         38
4         53
5         28
6         37
7         49
8         52
9         31
10        42
11        37
12        30
13        23
14        32
15        40
16        34
17        25
18        32
19        38
20        43
21        40
22        54
23        35
24        43
25        59
26        56
27        19
28        54
29        39
        ... 
32532     30
32533     34
32534     54
32535     37
32536     22
32537     34
32538     30
32539     38
32540     71
32541     45
32542     41
32543     72
32544     45
32545     31
32546     39
32547     37
32548     43
32549     65
32550     43
32551     43
32552     32
32553     43
32554     32
32555     53
32556     22
32557     27
32558     40
32559     58
32560     22
32561     52
Name: age, Length: 32562, dtype: object


Awesome! We now have a quick view of the values in that column. That's pretty useful! But let's say we want to see the values in two columns at the same time? Pandas is a bit persnickety in that you can enter in a single column name directly, but if you want to see multiple columns simultaneously, you'll need to put those column names in a list and then put that list between the square brackets folloding the variable "df_data", like so:

In [30]:
df_data[['age', 'education']]

Unnamed: 0,age,education
0,age,education
1,39,Bachelors
2,50,Bachelors
3,38,HS-grad
4,53,11th
5,28,Bachelors
6,37,Masters
7,49,9th
8,52,HS-grad
9,31,Masters



Awesome! So now that we can SEE what's in two columns, what if we want to create a new, tiny glorified excel sheet with only those two? How can we do that? And is there a way that we can save that data sheet for later? have no fear--that's easy with pandas.


In [32]:
#To create a new dataframe using parts of an existing one, you can assign
# the values from the previous dataframe to a new one like this:
df_new = df_data[['age', 'education']]

#And you can save that datasheet to a new file like so:
# NOTE: I purposely put each parameter on a new line to illustrate what
# each parameter does. Usually, you can write all of this on one line.

df_new.to_csv('./data-sets/new-data.txt', # This is the where you're going to put the new data!
              sep=',',                    # This is how that data will be separated into cells in the new document
              header=True,                # And as a fellow programmer, please always save the headers!
              index=False,                # But the index--the row number--is not always useful, plus pandas can figure that out on its own.
              encoding='utf-8'            # Lastly, always set the encoding for the characters. UTF-8 is a universal encoding, so I recommend using it, ALWAYS.
             )


Great! Now we not only can load in our own data, but we can also create new data-sheets from it, and save those sheets to somewhere on our computer. You're on your way! Let's talk a little bit, now, about what to do with our data besides staring googly-eyed at it. Let's (1) convert all the values in a column to a list, and (2) build a dicitonary out of that list. The next bit of code will transform a column in pandas into a list, and then add another set of columns to that list.


In [33]:
#To convert a column to a list, we can do something like this:
new_list = df_data['education'].values.tolist()

#And let's say we want to make a dictionary of all the values in this list
# plus another one filled with useful vocabulary points. We can do that 
# like this:
new_list += df_data['native_country'].values.tolist()


It's important to note two things: (1) you can append a bunch of items to a list by adding two lists together. This can save you a lot of time, later. But secondarily, the "+=" above works by saving whatever was in the list before adding the new list to it, and then adding the new list to that. If you forget the "+" though, you'll overwrite the old list and lose all the values from that.

Now, let's take those values and convert them into a dictionary that we can use to numerically represent all the words we just collected. There's a couple of great ways to cheat in this process, but I want to show you how to do this using a "for loop", and a simple counter--anything more complicated than this is, honestly, probably just a headache.


In [34]:
#First, let's make a counter and start it at 0.
counter=0

#Now, let's make an empty dictionary, like what we talked about in the last
# section.
new_dictionary={}


#Now, let's create a for loop that (1) adds new entries into the dictionary
# where the values are a number--which we generate using the counter. That's
# a lot of verbage to say let's make a bunch of dictionary entries that
# look like {'word': 1}
for item in set(new_list):
    new_dictionary[item]=counter
    #And by doing +=1 to the counter, every time we add an example, the 
    # counter will go up by one--genius!
    counter+=1


Awesome! A couple of quick notes: the function, set(), is built in to python. What it does is takes a list and gets rid of all the repetitive bits in that list. So a set made from the list 

[1,2,3,1,2] 

would look like 

[1,2,3].

There are going to be A LOT of repeated entries in the list we made for education and native_country, and we honestly don't want all the repetition. If an item already exists in a dictionary, and we do new_dictionary[item]= counter, the key--that first item--will still exist, but what the dictionary had before for the value--the second item--will be replaced with the new number for counter. This can lead to gaps in your data later on, so let's avoid this at all costs.

Awesome! We now have a dictionary for all the items in both education and native_county. But what if you want multiple dictionaries? What then? Well in that case, what we ought to do is the write a function that we can call, that RETURNS a new dictionary whenever we'd like. We can do that with a bit of code like this, which has one input--some list that we'll refer to as listin, and RETURNS a dictionary made from that list:


In [35]:
def create_dictionary(listin):
    #Let's create an empty dictionary up top that we can save 
    # everything to as we go through the list, like we did in
    # our for loop . . .
    dictionary={}
    
    #And we'll add a counter, too:
    counter=0
    
    #and recreate our for loop, but this time use "listin" as
    # a stand-in for ANY list we want to create a dictionary with.
    for item in set(listin):
        dictionary[item]=counter
        counter+=1
        
    #Lastly, we'll return the dictionary so that we can use it outside
    # of the function.
    return dictionary
    


An important thing to note about the function above, everything that is indented is INSIDE of the function. It's part of it. That means that unless we RETURN that little bit of the function, we'll never get to see or use it in any other bits of code, which would be a bummer. But, by creating a function that returns something to us, we can use the same function over and over again for all kinds of other purposes, which saves us A LOT of typing. And that's worth it. So let's use this function on our last list of words, and then try it on a new list of words, too!


In [37]:
dictionary_2 = create_dictionary(new_list)

#And let's make a new list from our df_data data-sheet from the marital
# status column, and name this new list . . . well, marital_status?
marital_status = df_data['marital_status'].values.tolist()

#And then use this new list to create ANOTHER dictionary.
marital_dictionary = create_dictionary(marital_status)


And we can check on these two dictionaries--and how they differ, but
looking at their length. Length is often a good, quick metric to use
to check the difference between lists, dictionaries, and tuples, and
will be a useful way to cheat manually entering a bunch of numbers 
every time we build a new Neural Network in the following lesson.


In [41]:
print( len(dictionary_2) )
print( len(marital_dictionary) )

60
8



Wham! And that's a lot of pandas! Now, there's a lot more you can do with pandas, and if you want to learn more about that you can check out the developer's website here: https://pandas.pydata.org/pandas-docs/stable/api.html. But for right now, I want to cover briefly numpy and how we'll use it in the next lesson, and finish up with how to use numpy and pandas together to create new datasheet from scratch.

Numpy is a tool that's used to create structured matrices/arrays of items . . . think of an array as a list that has a set organization--it's split into columns and rows, as opposed to being just a long snake of items. Let's say, then we have a list with four numbers. We can use numpy to transform this list into three arrays with different shapes like so:

NOTE: the '\n' after each array in the print statement simply prints a blank line underneat the array . . . if you want to put multiple items in a print statement, you can--you just need to separate them with a comma.


In [44]:
#(1) First, let's make a list with some numbers in it.
listy = [1,2,3,4]

#(2) Second, let's make a numpy array that has no shape, just to see 
# what happens. We'll name it array_blob, because that's what it 
# is--a blob.
array_blob = np.array(listy)
print( array_blob, '\n')

#(3) Next, let's take listy, and RESHAPE it with numpy to have 1 row, and
# 4 columns.
array_1x4 = np.array(listy).reshape(1, 4) #1= the number of rows, 4= the number of columns
print( array_1x4, '\n')

#(4) Now, let's make an array out of listy with 1 column and 4 rows.
array_4x1 = np.array(listy).reshape(4,1)
print( array_4x1, '\n')

#(5) Lastly, let's make an array out of listy with 2 columns, and 2 rows.
array_2x2 = np.array(listy).reshape(2,2)
print( array_2x2 )


[1 2 3 4] 

[[1 2 3 4]] 

[[1]
 [2]
 [3]
 [4]] 

[[1 2]
 [3 4]]



Cool! Let's take array_2x2 now, and create a new pandas datasheet--or DataFrame--out of it. To do this, we can use a snippet of code like this:


In [47]:
df_2x2 = pd.DataFrame(array_2x2,          #The array that we want to use to create the new pandas DataFrame
                     columns=['A', 'B']   #What in G-d's green thumb we're going to call the new columns in the DataFrame
                     )

df_2x2

Unnamed: 0,A,B
0,1,2
1,3,4



And of course, if we want to save it to a new file, we can do that using the .to_csv function from pandas.


In [48]:
df_2x2.to_csv('./data-sets/2x2.txt', 
              header=True, 
              index=False, 
              encoding='utf-8'
             )


And voila! Your first lesson in pandas and numpy is complete! Again, the two libraries can do a heck of a lot more than what we just covered, and for more information about them I highly recommend you visit the developers' websites for both. You can find them at the following links:

Pandas:
https://pandas.pydata.org/pandas-docs/stable/api.html

Numpy:
https://docs.scipy.org/doc/numpy-1.13.0/reference/

Think of these reference guides as cook-books that you can use to brew all kinds of new goodies in code! Next up, we'll talk about how to take a file from your computer and load that into a fully functional Deep Neural Network (or DNN as it's called) to build your first Deep Learning, Artifical Intelligence.