# Python Basics

## When to use Python?

Python is a pretty versatile language. It is useful, for instance, for the following applications:
* Quick calculations
* Develop a database-driven website
* Clean and analyze results of a satisfaction survey

## Comments

We can add **comments** to our Python scripts. Comments are important to make sure that you and others can understand what your code is about.

To add comments to our Python script, we can use the `#` tag. These comments are not run as Python code, so they will not influence our result.

In [1]:
# Just testing adition
print("7 + 10 = {}".format(7 + 10))

# Trying subtraction
print("18 - 15 = {}".format(18 - 15))

# Multiplication works
print("13 * 5 = {}".format(13 * 5))

# Addition works too
print("5 / 8 = {}".format(5 / 8))

7 + 10 = 17
18 - 15 = 3
13 * 5 = 65
5 / 8 = 0.625


## Python as a calculator

Python is perfectly suited to do basic calculations. Apart from addition, subtraction, multiplication and division, there is also support for more advanced operations such as:
* Exponentiation: `**`. This operator raises the number to its left to the power of the number to its right. For example `4 ** 2` will give `16`.
* Modulo: `%`. This operator returns the remainder of the division of the number to the left by the number on its right. For example `18 % 7` equals `4`.

In [2]:
# Exponentiation
print("4 ** 2 = {}".format(4 ** 2))

# Modulo
print("18 % 7 = {}".format(18 % 7))

4 ** 2 = 16
18 % 7 = 4


Suppose we have $100, which we can invest with a 10% return each year.
* After one year, it's 100 x 1.1 = 110 dollars.
* After two years it's 100 x 1.1 x 1.1 = 121 dollars.

We can calculate how much money we end up with after seven years as follows: 

In [3]:
# Amount after seven years
amount = round(100 * (1.1 ** 7), 2)
print(str(amount) + " dollars")

194.87 dollars


## Variable assignment

In Python, a variable allows us to refer to a value with a name. To create a variable, we use `=`, like this example:

`x = 5`

We can now use the name of this variable, `x`, instead of the actual value, `5`.

**Remember:** `=` in Python means *assignment*, not equality!

In [4]:
# Create a variable savings and assigns 100 to it
savings = 100

# Print out savings
print(savings)

100


## Calculations with variables

Instead of calculating with values, we can use variables instead.

In [5]:
# Create a variable factor and assigns 1.1 to it
factor = 1.1

# Calculate how much money we end up with after five years
result = round(savings * (factor ** 5), 2)
print(result)

161.05


## Some variable types

* `int`, or integer: a number wothout a fractional part. `savings`, with the value of `100` is an example of an integer.
* `float`, or floating point: a number that has both an integer and a fractional part, separated by a point. `factor`, with the value of `1.1` is an example a float.
* `str`, or string: a type to represent text. We can use single or double quotes to build a string.
* `bool`, or boolean: a type to represent logical values. Can only be `True` or `False` (the capitalization is important).

In [6]:
# An example of an integer
integer_example = 10
print("An example of an integer: {}".format(integer_example))

# An example of a float
float_example = 1.25
print("An example of a float: {}".format(float_example))

# An example of a string
string_example = 'This is a string!'
print("An example of a string: {}".format(string_example))

# An example of a boolean
boolean_example = True
print("An example of a boolean: {}".format(boolean_example))

An example of an integer: 10
An example of a float: 1.25
An example of a string: This is a string!
An example of a boolean: True


To find out the type of a value or a variable that refers to that value, we can use the `type()` function. Suppose we have defined a variable `a`, but we forgot the type of this variable. To determine the type of `a`, we can simply execute `type(a)`.

In [7]:
print(type(integer_example))
print(type(float_example))
print(type(string_example))
print(type(boolean_example))

<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>


## Operations with different types

Different types behave differently in Python. When we sum two strings, for example, we'll get different behavior than when we sum two integers or two booleans.

In [8]:
# Variables to work with
savings = 100
factor = 1.1
desc = "compound interest"

# Assign product of factor and savings to year1 and print its type
year1 = factor * savings
print(type(year1))

# Assign sum of desc and desc to doubledesc and print it
doubledesc = desc + desc
print(doubledesc)

<class 'float'>
compound interestcompound interest


## Type conversation

Using the `+` operator to paste together two strings can be very useful in building custom messages.

Suppose, for example, that we've calculated the return of our investment and want to summarize the results in a string. Assuming the floats `savings` and `result` are defined, we can try:

In [9]:
print("I started with $" + savings + " and now have $" + result + ". Awesome!")

TypeError: must be str, not int

This does not work, though, as we cannot simply sum strings and floats.

To fix the error, we'll need to explicitly convert the types of our variables. More specifically, we'll need `str()` to convert a value to a string. `str(savings)`, for example, will convert the float `savings` to a string.

Similar functions such as `int()`,  `float()` and `bool()` will help you convert Python values into any type.

In [10]:
# Fix the printout
print("I started with $" + str(savings) + "  and now have $" + str(result) + ". Awesome!")

I started with $100  and now have $161.05. Awesome!


# Python Lists 

## Create a list

As opposed to `int`, `bool` etc., a list is a **compound data type**; you can group values together.

After measuring the height of your family, you decide to collect some information on the house you're living in. The areas of the different parts of the house are stored in separate variables for now, as shown below:

In [11]:
# Area variables in square meters
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Create list areas
areas = [hall, kit, liv, bed, bath]

# Print areas
print(areas)

[11.25, 18.0, 20.0, 10.75, 9.5]


## Create list with different types

A lit can contain any Python type. Although it's not really common, a list can also contain a mix of Python types including strings, floats, booleans, etc.

The printout of the previous code cell wasn't really satisfying. It's just a list of numbers representing the areas, but we can't tell which area corresponds to which part of the house. 

The code below is the start of a solution:

In [12]:
# The beginning of a solution
areas = ["hallway", hall, "kitchen", kit, "living room", liv, "bedroom", bed, "bathroom", bath]

# Print areas
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5]


## List of lists

As a data scientist, you'll often be dealing with a lot of data, and it will make sense to group some of this data.

Instead of creating a flat list containing strings and floats, representing the names and areas of the rooms in the house, we can create a list of lists. The code below can already give us an idea:

In [13]:
# House information as a list of lists
house = [["hallway", hall],
         ["kitchen", kit],
         ["living room", liv],
         ["bedroom", bed],
         ["bathroom", bath]]

# Print house
print(house)

[['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]


## Subset and conquer

Subsetting Python lists is a piece of cake. The code below, for example, creates a list `x` and than selects "b" from it. Notice that this is the second element, so it has index 1, and we can also use negative indexing.

In [14]:
# Create list x
x = ["a", "b", "c", "d"]

# Print second element of x
print(x[1])

# Print second element of x using negative indexing
print(x[-3])

b
b


Using the first `areas` list, that contains both strings and floats, we'll do some subsetting in the code cell below:

In [15]:
# Definition of areas
areas = ["hallway", hall, "kitchen", kit, "living room", liv, "bedroom", bed, "bathroom", bath]

# Print second element from areas
print(areas[1])

# Print last element from areas using negative indexing
print(areas[-1])

# Print the area of the living room
print(areas[5])

11.25
9.5
20.0


## Subset and calculate

After we've extracted values from a list, we can use them to perform additional calculations.

The code cell below extracts the second and the fourth elements of x. The strings tha result are pasted together using the `+` operator.

In [16]:
# Print the sum of the second and the fourth element of x
print(x[1] + x[3])

bd


Now, using a combination of list subsetting and variable assignment, we'll create a new variable that contains the sum of the areas of the kitchen and the area of the bedroom:

In [17]:
# Create eat_sleep_area variable and print it
eat_sleep_area = areas[3] + areas[7]
print(eat_sleep_area)

28.75


## Slicing and dicing

Selecting single values from a list is just one part of the story. It's also possible to *slice* a list, which means selecting multiple elements of it, using the following sintax: `my_list[start:end]`. The `start` index will be included, while the `end` index will *not*.

The code cell below show an example. A list with `b` and `c`, corresponding to indexes 1 and 2, is selected from x. Notice that the elements with index 1 and 2 are included, while the element with index 3 is not.

In [18]:
print(x[1:3])

['b', 'c']


It's also possible not to specify where to begin and end the slice of the list. If we don't specify the `begin` index, Python figures out that we want to start our slice at the beginning of the list. If we don't specify the `end` index, whe slice will go all the way to the last element of the list.

Below, we'll use slicing to create both downstairs and upstairs lists from areas list:

In [19]:
# Create and print downstairs, that contains the first 6 elements of areas
downstairs = areas[:6]
print(downstairs)

# Create and print upstairs, that contains the last 6 elements of areas
upstairs = areas[6:]
print(upstairs)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0]
['bedroom', 10.75, 'bathroom', 9.5]


## Subsetting lists of lists

We saw before that a Python list can contain pratically anything, even other lists. To subset lists of lists, we can use the same technique as before: square brackets.

In [20]:
# Define x
x = [["a", "b", "c"],
     ["d", "e", "f"],
     ["g", "h", "i"]]

# Examples of subsetting x
print(x[2][0])
print(x[2][:2])

g
['g', 'h']


Using the house list of lists again, `house[-1][1]` will return the bathroom area:

In [21]:
# Define house as a list of lists
house = [["hallway", hall],
         ["kitchen", kit],
         ["living room", liv],
         ["bedroom", bed],
         ["bathroom", bath]]

# Subset house
print(house[-1][1])

9.5


## Replace list elements

Replacing list elements is pretty easy. Simply subset the list and assign new values to the subset. We can select single elements or we can change entire list slices at once:

In [22]:
# Create and print x
x = ["a", "b", "c", "d"]
print(x)

# Replace some elements and print x
x[1] = "r"
x[2:] = ["s", "t"]
print(x)

['a', 'b', 'c', 'd']
['a', 'r', 's', 't']


Below, we will replace elements in the `areas` list.

In [23]:
# Create areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Replace bathroom area
areas[-1] = 10.50

# Replace living room description
areas[4] = "chill zone"

# Print areas
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5]


## Extend a list

We can use the `+` operator to add elements to a list.

In [24]:
# Create x
x = ["a", "b", "c", "d"]

# Extends x
x = x + ["e", "f"]

# Print x
print(x)

['a', 'b', 'c', 'd', 'e', 'f']


Below, we will add poolhouse and garage to the areas list:

In [25]:
# Print areas
print(areas)

# Extends areas list
areas = areas + ["poolhouse", 24.5, "garage", 15.45]

# Print areas (extedend)
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5]
['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5, 'poolhouse', 24.5, 'garage', 15.45]


## Delete list elements

We can use `del` statement to remove elements from a list. Notice that as soon as we remove an element from a list, the indexes of the elements that come after the deleted element all change.

In [26]:
# Create x
x = ["a", "b", "c", "d"]

# Remove the second element of x
del(x[1])

# Print x
print(x)

['a', 'c', 'd']


Suppose we don't have a poolhouse. We can delete it from our areas list as follows:

In [27]:
# Print areas
print(areas)

# Remove poolhouse from the list
del(areas[-4])
del(areas[-3])

# Print areas (after deleting poolhouse and its area)
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5, 'poolhouse', 24.5, 'garage', 15.45]
['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5, 'garage', 15.45]


## Inner working of lists

The Python code below creates a list with the name `areas` and a copy named `areas_copy`. After that, the first element in the `areas_copy` is changed and the `areas` list is printed out. Notice that, although we've changed `areas_copy`, the change also takes effect in the `areas` list. That's because `areas` and `areas_copy` point to the same list.

In [28]:
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Create areas_copy
areas_copy = areas

# Change areas_copy
areas_copy[0] = 5.0

# Print areas
print(areas)

[5.0, 18.0, 20.0, 10.75, 9.5]


To prevent changes in `areas_copy` to also take effect in `areas`, we'll have to do a more explicit copy of the `areas` list. We can do this with `list()` or by using `[]`.

In [29]:
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Create areas_copy
areas_copy = list(areas)

# Change areas_copy
areas_copy[0] = 5.0

# Print areas
print(areas)

[11.25, 18.0, 20.0, 10.75, 9.5]


# Functions and Packages

## Familiar functions

Out of the box, Python offers a bunch of built-in functions to make a data scientist's life easier. We already know two such functions: `print()` and `type()`. We've also used `str()`, `int()`, `bool()` and `float()` to switch between data types. These are built-in functions as well.

Calling a function is easy. To get the type of `3.0` and ostore the output as a new variable, `result`, we can use the following: `result = type(3.0)`.

The general recipe for calling functions is thus: `output = function_name(input)`.

In [30]:
# Create variables var1 and var2
var1 = [1, 2, 3, 4]
var2 = True

# Print out type of var1
print(type(var1))

# Print out length of var1
print(len(var1))

# Convert var2 to an integer and assign it to out2
out2 = int(var2)

# Print out out2
print(out2)

<class 'list'>
4
1


## Help!

Maybe we already know the name of a Python function, but we still have to figure out how to use it. Ironically, we have to ask for information about a function with another funtion: `help()`.

To get help on the `complex()` function, for example, we can use `help(complex)`.

In [31]:
# Open up the documentation on max()
print(help(max))

Help on built-in function max in module builtins:

max(...)
    max(iterable, *[, default=obj, key=func]) -> value
    max(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its biggest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the largest argument.

None


## Multiple arguments

Below, we'll have a look at the documentatino of `sorted`:

In [32]:
# Open the documentation on sorted()
print(help(sorted))

Help on built-in function sorted in module builtins:

sorted(iterable, /, *, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customize the sort order, and the
    reverse flag can be set to request the result in descending order.

None


As we can see, `sorted()` takes three arguments: `iterable`, `key` and `reverse`.

`key=None` means that if we don't specify the `key` argument, it will be `None`. `reverese=False` means that if we don't specify the `reverse` argument, it will be `False`.

Below, we'll create two lists, paste them together and sort them in descending order. For that, we'll only have to specify `iterable` and `reverse`. The first input we pass to `sorted()` will be matched to the `iterable` argument, and to tell Python we want to specify `reverse` without changing anything about `key`, we'll use `=`. Notice that, for now, we can undestand an *iterable* as being any collection of objects, e.g. a List.

In [33]:
# Create lists 
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]

# Paste lists together and assign it to full
full = first + second

# Sort full in descending order and assign it to full_sorted
full_sorted = sorted(full, reverse = True)

# Print out full_sorted
print(full_sorted)

[20.0, 18.0, 11.25, 10.75, 9.5]


## Methods

We can think of methods as functions that belongs to objects.

## String methods

Strings come with a bunch of methods. Below, we'll explore some of them.

In [34]:
# Create a string and assign it to room
room = "poolhouse"

# Use upper() method on room and assign it to room_up
room_up = room.upper()

# Use count() method on room, with the letter "o" as input, and assign it to room_count
room_count = room.count("o")

# Print out room, room_up and room_count
print(room)
print(room_up)
print(room_count)

poolhouse
POOLHOUSE
3


## List methods

Strings are not the only Python type that have methods associated with it. Lists, floats, integers and booleans are also types that come packaged with a bunch of useful methods. Below, we'll use:
* `index()` to get the index of the first element of `areas` that matches its input; and 
* `count()` to get the number of times `14.5` appears in `areas`.

In [35]:
# Create areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Print out the index of the element 20.0
print(areas.index(20.0))

# Print out how often 14.5 appears in areas
print(areas.count(14.5))

2
0


Most list methods changes the list they're called on. Examples are:
* `append()`, that adds an element to the list its called on;
* `remove()`, that removes the first element of a list that matches the input; and
* `reverse()`, that reverses the order of the elements in the list it is called on.

Below, we'll see how these three methods works in pratice.

In [36]:
# Add two new sizes to the areas list
areas.append(24.5)
areas.append(15.45)
print(areas)

# Remove 9.50 from areas
areas.remove(9.50)
print(areas)

# Reverse the order of the elements in areas
areas.reverse()
print(areas)

[11.25, 18.0, 20.0, 10.75, 9.5, 24.5, 15.45]
[11.25, 18.0, 20.0, 10.75, 24.5, 15.45]
[15.45, 24.5, 10.75, 20.0, 18.0, 11.25]


## Packages

We can think of packages as directory of Python scripts. Each script is called *module* and specify functions, methods and types. `Numpy`, `Matplotlib` and `Scikit-learn` are examples of packages.

## Import packages

For a data scientist, some notion of Geometry never hurts. In the code cell below, we'll refresh some of the basics.

For a fancy clustering algorithm, suppose we want to find the circumference `C` and the area `A` of a circle. When the radius of the circle is `r`, we can calculate `C` and `A` as:
* C = 2 x pi x r
* A = pi x (r ^ 2)

To use the constant `pi`, we'll need the `math` package, which we will import using `import math`.

In [37]:
# Import math package
import math

# Define the radius
r = 0.43

# Calculate the circumference and the area
C = 2 * math.pi * r
A = math.pi * (r ** 2)

# Print out the results
print("Circumference: " + str(C))
print("Area: " + str(A))

Circumference: 2.701769682087222
Area: 0.5808804816487527


## Selective import

General imports, like `import math`, make **all** functionality from the `math` package available. However, if we decide to only use a specific part of a package, we can make our import more selective: `from math import pi`.

Below, we'll suppose that the Moon's orbit around the Earth is a perfect circle and calculate its travel distance over 12 degrees. For that we'll import `radians` module from `math` package.

In [38]:
# Import radians module
from math import radians

# Define the radius
r = 192500

# Calculate Moon's travel distance over 12 degrees and assign it to distance
distance = r * radians(12)

# Print out the result
print(distance)

40317.10572106901


## Different ways of importing

In order to use a function, we can import it in several ways.

In [39]:
# Import the whole package
import scipy

# Import the module
from scipy import linalg

# Import the function
from scipy.linalg import inv

# Import the function with a specific name
from scipy.linalg import inv as my_inv

# NumPy

NumPy, or Numeric Python, provides an alternative to Pthon List: the NumPy Array. Over a NumPy Array we can perform calculations over entire arrays, with an easy and fast way. Below, we'll create a Numpy array from a list.

In [40]:
# Import numpy
import numpy as np

# Create a list that represents the heigth of some baseball players in centimeters
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Create a numpy array from baseball list
np_baseball = np.array(baseball)

# Print out the type of np_baseball
print(type(np_baseball))

<class 'numpy.ndarray'>


## Baseball players' height

Suppose we decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which we'll store as a DataFrame and transform their columns in regular python lists. Height is expressed in inches. In the code cell below, we'll make a `numpy` array out of it and convert the unit to meters.

In [42]:
# In these first lines, we'll import the dataset as a DataFrame and extract the column Height as a list
import pandas as pd
dataframe = pd.read_csv('baseball.csv')
height = dataframe['Height'].tolist()

# Import numpy
import numpy as np

# Create a numpy array from the list
np_height = np.array(height)

# Convert values to meters
np_height_meters = np_height * 0.0254

# Print both np_height and np_height_meters
print(np_height)
print(np_height_meters)

[74 74 72 ..., 75 75 73]
[ 1.8796  1.8796  1.8288 ...,  1.905   1.905   1.8542]


## Baseball players' BMI

The MLB also offered to let us analyze their weight data. Again, we'll transform a column of the dataframe in a list in order to make our calculations. Weight is in pounds. Below, we'll convert it to kilograms and calculate players' BMI.

In [44]:
# Extract the column Weight as a list
weight = dataframe['Weight'].tolist()

# Create a numpy array from the list above
np_weight = np.array(weight)

# Convert values to kilograms
np_weight_kilograms = np_weight * 0.453592

# Calculate players' BMI
bmi = np_weight_kilograms / (np_height_meters ** 2)

# Print players' BMI
print(bmi)

[ 23.11037639  27.60406069  28.48080465 ...,  25.62295933  23.74810865
  25.72686361]


## Lightweight baseball players

To subset both regular Python lists and numpy arrays, we can use square brackets, as follows:

In [45]:
x = [1, 2, 3, 4]
print(x[1])

y = np.array(x)
print(y[1])

2
2


For numpy specifically, we can also use boolean numpy arrays:

In [46]:
print(y[y > 2])

[3 4]


Based on that, we'll now print out a numpy array with the BMIs of all baseball players whose BMI is below 21:

In [47]:
# Create an array to store lightweight baseball players
light = bmi < 21

# Print light (boolean values)
print(light)

# Print BMIs of all baseball players whose BMI is below 21
print(bmi[light])

[False False False ..., False False False]
[ 20.54255679  20.54255679  20.69282047  20.69282047  20.34343189
  20.34343189  20.69282047  20.15883472  19.4984471   20.69282047
  20.9205219 ]


## NumPy side effects

`numpy` is great for doing vector arithmetic. If we compare its functionality with regular Python lists, however, some things change.

Firs of all, `numpy` arrays cannot contain elements with different types. If we try to build such a list, some of the elements' types are changed to end up with a homogeneous list. This is know as *type coercion*.

Second, the typical arithmetic operators, such as `+`, `-`, `*` and `/` have a different meaning for regular Python lists and `numpy` arrays. Take as an example:

In [48]:
np.array([True, 1, 2]) + np.array([3, 4, False])

array([4, 5, 2])

## Subsetting NumPy arrays

Python lists and numpy arrays sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting works exactly the same. In the code cell below, we'll explore a little more how to subset numpy arrays.

In [50]:
# Print the weight at index 50
print(weight[50])

# Print sub-array containing the heights from index 100 up to and including index 110
print(height[100:111])

200
[73, 74, 72, 73, 69, 72, 73, 75, 75, 73, 72]


## Our first 2D NumPy array

Before working on the actual MLB data, let's try to create a 2D numpy array from a small list of lists. In the code cell below, the main list contain 4 elements, and each of these elements is a list containing the height and the weight of 4 baseball players, in this order.

In [52]:
# Create list of lists
list_of_lists = [[180, 78.4], 
                 [215, 102.7], 
                 [210, 98.5], 
                 [188, 75.2]]

# Create a 2D numpy array from the list of lists
np_list_of_lists = np.array(list_of_lists)

# Print the type and the shape of np_list_of_lists (notice that shape is an attribute of the numpy array)
print(type(np_list_of_lists))
print(np_list_of_lists.shape)

<class 'numpy.ndarray'>
(4, 2)


## Baseball data in 2D form

If we have another look at the MLB data, we'll realize that it makes more sense to restructure all this information in a 2D numpy array. This array should have 1015 rows, corresponding to the 1015 baseball players we have information on, and 2 columns, height and weight.

In the code cell below, we'll create the list of lists from which we'll get the 2D numpy array and print out the shape of the 2D numpy array.

In [65]:
# Create list of lists
mlb_list_of_lists = []
for row in range(len(dataframe)):
    height = dataframe.iloc[row, 3]
    weight = dataframe.iloc[row, 4]
    mlb_list_of_lists.append([height, weight])

# Create a 2D numpy array from the list of lists
np_mlb_list_of_lists = np.array(mlb_list_of_lists)

# Print the shape of the 2D numpy array
print(np_mlb_list_of_lists.shape)

(1015, 2)


## Subsetting 2D NumPy arrays

If a 2D numpy array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Let's have a look at the code cell below, where the elements `"a"` and `"c"` are extracted from a list of lists.

In [67]:
# Create a regular list of lists
x = [["a", "b"], ["c", "d"]]
print([x[0][0], x[1][0]])

# Using numpy
np_x = np.array(x)
print(np_x[:, 0])

['a', 'c']
['a' 'c']


For regular Python lists this is a real pain. For 2D numpy arrays, however, it's pretty intuitive. The indexes before the comma refer to the rows, while thos after the comma refer to the columns. The `:` is for slicing; in this example, it tells Python to include all the rows.

In the code cell below, we'll print out the 50th row of `np_mlb_list_of_lists`, make a new variable containing the entire second column of it, and print out the height (first column) of the 124th player.

In [69]:
# Print the 50th row
print(np_mlb_list_of_lists[49,:])

# Create new variable containing the entire second column of np_mlb_list_of_lists and print it
new_variable = np_mlb_list_of_lists[:,1]
print(new_variable)

# Print the height of the 124th player
print(np_mlb_list_of_lists[123,0])

[ 70 195]
[180 215 210 ..., 205 190 195]
75


## 2D arithmetic

Remember how we calculated the Body Mass Index for all baseball players? `numpy` was able to perform all calculations element-wise (i.e. element by element). For 2D `numpy` arrays this isn't any different. We can combine matrices with single numbers, with vectors, and with other matrices. Take the code cell below as an example:

In [71]:
# Create a 2D numpy array
matrice = np.array([[1, 2], 
                    [3, 4], 
                    [5, 6]])

# Operations
print(matrice * 2)
print(matrice + np.array([10, 10]))
print(matrice + matrice)

[[ 2  4]
 [ 6  8]
 [10 12]]
[[11 12]
 [13 14]
 [15 16]]
[[ 2  4]
 [ 6  8]
 [10 12]]


In the following code cell, we'll get a new 2D numpy array from the dataset provided by the MLB containing players' weight, height and age. After that, we'll convert units using a numpy array named `conversion`.

In [90]:
# Create new 2D numpy array
new_list_of_lists = []
for row in range(len(dataframe)):
    height = dataframe.iloc[row, 3]
    weight = dataframe.iloc[row, 4]
    age = dataframe.iloc[row, 5]
    new_list_of_lists.append([height, weight, age])
new_2d_array = np.array(new_list_of_lists)

# Create an array with conversion values
conversion = np.array([0.0254, 0.453592, 1])

# Convert values and print it
converted_values = new_2d_array * conversion
print(converted_values)

[[  1.8796   81.64656  22.99   ]
 [  1.8796   97.52228  34.69   ]
 [  1.8288   95.25432  30.78   ]
 ..., 
 [  1.905    92.98636  25.19   ]
 [  1.905    86.18248  31.01   ]
 [  1.8542   88.45044  27.92   ]]


## Average versus median

As we can see below, the summary statistic named `median` is best suited if we're dealing with so-called *outliers*.

In [93]:
# Pre-code in order to achieve the purpose of the discussion
another_list_of_lists = []
for row in range(len(dataframe)):
    if row == 0 or row % 50 == 0:
        height = dataframe.iloc[row, 3] * 1000
        weight = dataframe.iloc[row, 4] * 1000
        age = dataframe.iloc[row, 5] * 1000
        another_list_of_lists.append([height, weight, age])
    else:
        height = dataframe.iloc[row, 3]
        weight = dataframe.iloc[row, 4]
        age = dataframe.iloc[row, 5]
        another_list_of_lists.append([height, weight, age])
another_2d_array = np.array(another_list_of_lists)

# Print the mean of heights in new_2d_array
print(np.mean(another_2d_array[:,0]))

# Print the median of heights in new_2d_array
print(np.median(another_2d_array[:,0]))

1586.46108374
74.0


## Exploring the baseball data

In the code cell below, we'll explore the data provided by MLB.

In [95]:
# Print mean of height
average = np.mean(new_2d_array[:,0])
print("Average: " + str(average))

# Print median of height
median = np.median(new_2d_array[:, 0])
print("Median: " + str(median))

# Print out the standard deviation of height
standard_deviation = np.std(new_2d_array[:, 0])
print("Standard Deviation: " + str(standard_deviation))

# Print out correlation between first and second column
correlation = np.corrcoef(new_2d_array[:, 0], new_2d_array[:, 1])
print("Correlation: " + str(correlation))

Average: 73.6896551724
Median: 74.0
Standard Deviation: 2.31279188105
Correlation: [[ 1.          0.53153932]
 [ 0.53153932  1.        ]]


## Blending it all together

Suppose we've contacted FIFA for some data and they handed us two lists: `positions = ['GK', 'M', 'A', 'D', ...]` and `heights = [191, 184, 185, 180, ...]`. Each element in the lists corresponds to a player.

In the code cell below, we'll convert the lists to numpy arrays, extract the heights of the goalkeepers, extract the heights of all other players, print out the median height of the goalkeepers and do the same for the other players.

In [100]:
# In these first lines, we'll import the dataset as a DataFrame and extract the columns as lists
fifa_dataframe = pd.read_csv('fifa.csv', skipinitialspace = True, usecols = ['position', 'height'])
fifa_positions = list(fifa_dataframe.position)
fifa_heights = list(fifa_dataframe.height)

# Convert lists to numpy arrays
np_fifa_positions = np.array(fifa_positions)
np_fifa_heights = np.array(fifa_heights)

# Extract heights of the goalkeepers
gk_heights = np_fifa_heights[np_fifa_positions == 'GK']

# Extract heights of the other players
other_players_heights = np_fifa_heights[np_fifa_positions != 'GK']

# Print goalkeepers median height and other players median height
print("Median height of goalkeepers: " + str(np.median(gk_heights)))
print("Median height of other players: " + str(np.median(other_players_heights)))

Median height of goalkeepers: 188.0
Median height of other players: 181.0
