# NumPy

In [40]:
import pandas as pd

baseball = pd.read_csv('datasets/MLB(baseball).csv')
print(baseball.head())

              Name Team       Position  Height  Weight    Age PosCategory
0    Adam_Donachie  BAL        Catcher      74     180  22.99     Catcher
1        Paul_Bako  BAL        Catcher      74     215  34.69     Catcher
2  Ramon_Hernandez  BAL        Catcher      72     210  30.78     Catcher
3     Kevin_Millar  BAL  First_Baseman      72     210  35.43   Infielder
4      Chris_Gomez  BAL  First_Baseman      73     188  35.71   Infielder


### Instructions

- Import the numpy package as `np`, so that you can refer to `numpy` with `np`.
- Use `np.array()` to create a `numpy` array from `baseball`. Name this array `np_baseball`.
- Print out the type of `np_baseball` to check that you got it right.

In [16]:
# Import the numpy package as np
import numpy as np

# Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out type of np_baseball
print(type(np_baseball))

print(np_baseball)

<class 'numpy.ndarray'>
[['Adam_Donachie' 'BAL' 'Catcher' ... 180 22.99 'Catcher']
 ['Paul_Bako' 'BAL' 'Catcher' ... 215 34.69 'Catcher']
 ['Ramon_Hernandez' 'BAL' 'Catcher' ... 210 30.78 'Catcher']
 ...
 ['Chris_Narveson' 'STL' 'Relief_Pitcher' ... 205 25.19 'Pitcher']
 ['Randy_Keisler' 'STL' 'Relief_Pitcher' ... 190 31.01 'Pitcher']
 ['Josh_Kinney' 'STL' 'Relief_Pitcher' ... 195 27.92 'Pitcher']]


## Baseball players' height
You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: `height_in`. The height is expressed in inches. Can you make a `numpy` array out of it and convert the units to meters?

`height_in` is already available and the numpy package is loaded, so you can start straight away.

### Instructions

- Create a `numpy` array from `height_in`. Name this new array `np_height_in`.
- Print `np_height_in`.
- Multiply `np_height_in` with `0.0254` to convert all height measurements from inches to meters. Store the new values in a new array, `np_height_m`.
- Print out `np_height_m` and check if the output makes sense.

In [3]:
# Assuming 'Height' is at index 3 in your array
np_height_in = np_baseball[:, 3]
print(np_height_in)

# Convert np_height_in to m: np_height_m
np_height_m = np_height_in * 0.0254

# Print np_height_m
print(np_height_m)

[74 74 72 ... 75 75 73]
[1.8796 1.8796 1.8288 ... 1.905 1.905 1.8541999999999998]


## Baseball player's BMI
The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: `height_in` and weight_lb. `height_in` is in inches and `weight_lb` is in pounds.

It's now possible to calculate the BMI of each baseball player. Python code to convert `height_in` to a `numpy` array with the correct units is already available in the workspace. Follow the instructions step by step and finish the game! `height_in` and `weight_lb` are available as regular lists.

### Instructions

- Create a `numpy` array from the `weight_lb` list with the correct units. Multiply by `0.453592` to go from pounds to kilograms. Store the resulting numpy array as `np_weight_kg`.
- Use `np_height_m` and `np_weight_kg` to calculate the BMI of each player. Use the following equation: ***bmi = np_weight_kg / np_height_m*** 
 
Save the resulting `numpy` array as `bmi`.
Print out `bmi`.

In [4]:
# Create array from weight_lb with metric units: np_weight_kg
weight_lb = np_baseball[:, 4]
np_weight_kg = np.array(weight_lb) * 0.453592

# Calculate the BMI: bmi
bmi = np.array(np_weight_kg / np_height_m**2)

# Print out bmi
print(bmi)

[23.11037638875862 27.604060686572797 28.48080464679448 ...
 25.62295933480756 23.74810865177286 25.726863613607133]


## Lightweight baseball players
To subset both regular Python lists and `numpy` arrays, you can use square brackets:

x = [4 , 9 , 6, 3, 1]
x[1]

y = np.array(x)
y[1]
For `numpy` specifically, you can also use boolean `numpy` arrays:

high = y > 5
y[high]

The code that calculates the BMI of all baseball players is already included. Follow the instructions and reveal interesting things from the data! `height_in` and `weight_lb` are available as regular lists.

### Instructions

- Create a boolean `numpy` array: the element of the array should be `True` if the corresponding baseball player's BMI is below 21. You can use the `<` operator for this. Name the array `light`.
- Print the array `light`.
- Print out a `numpy` array with the BMIs of all baseball players whose BMI is below 21. Use `light` inside square brackets to do a selection on the `bmi` array.

In [7]:
# Create the light array
light = bmi < 21

# Print out light
print(light)

# Print out BMIs of all baseball players whose BMI is below 21
bmi[light]

[False False False ... False False False]
[20.542556790007662 20.542556790007662 20.69282047151352 20.69282047151352
 20.343431890567484 20.343431890567484 20.69282047151352
 20.158834718074228 19.498447103560874 20.69282047151352 20.92052190452328]


array([20.542556790007662, 20.542556790007662, 20.69282047151352,
       20.69282047151352, 20.343431890567484, 20.343431890567484,
       20.69282047151352, 20.158834718074228, 19.498447103560874,
       20.69282047151352, 20.92052190452328], dtype=object)

## Subsetting NumPy Arrays

You've seen it with your own eyes: Python lists and numpy arrays sometimes behave differently. Luckily, there are still certainties in this world. For example,subsetting (using the square bracket notation on lists or arrays) works exactly the same. To see this for yourself, try the following lines of code in the IPython Shell:

x = ["a", "b", "c"]
x[1]

np_x = np.array(x)
np_x[1]

### Instructions

Subset `np_weight_lb` by printing out the element at index 50.
Print out a sub-array of `np_height_in` that contains the elements at index 100 up to and ***including**** index 110.

In [8]:
# Store weight and height lists as numpy arrays
np_weight_lb = np_baseball[:, 4]
np_height_in = np_baseball[:, 3]

# Print out the weight at index 50
print(np_weight_lb[50])

# Print out sub-array of np_height_in: index 100 up to and including index 110
print(np_height_in[100:111])

200
[73 74 72 73 69 72 73 75 75 73 72]


## Your First 2D NumPy Array
Before working on the actual MLB data, let's try to create a 2D `numpy` array from a small list of lists.

In this exercise, `baseball_ex` is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 `baseball_ex` players, in this order. `baseball_ex` is already coded for you in the script.

### Instructions

- Use `np.array()` to create a 2D numpy array from baseball. Name it `np_baseball_ex`.
- Print out the type of `np_baseball_ex`.
- Print out the shape attribute of `np_baseball_ex`. Use `np_baseball_ex.shape`.

In [13]:
# Create baseball, a list of lists
baseball_ex = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Create a 2D numpy array from baseball: np_baseball

np_baseball_ex = np.array(baseball_ex)
# Print out the type of np_baseball
print(type(np_baseball_ex))

# Print out the shape of np_baseball
print(np_baseball_ex.shape)

<class 'numpy.ndarray'>
(4, 2)


## Baseball data in 2D form
You have another look at the MLB data and realize that it makes more sense to restructure all this information in a 2D `numpy` array. This array should have 1015 rows, corresponding to the 1015 baseball players you have information on, and 2 columns (for height and weight).

The MLB was, again, very helpful and passed you the data in a different structure, a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this embedded list is `baseball`.

Can you store the data as a 2D array to unlock `numpy`'s extra functionality? `baseball` is available as a regular list of lists.

### Instructions

- Use `np.array()` to create a 2D `numpy` array from baseball. Name it `np_baseball`.
- Print out the shape attribute of `np_baseball`.



In [17]:
# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the shape of np_baseball
print(np_baseball.shape)

(1015, 7)


## Subsetting 2D NumPy Arrays
If your 2D `numpy` array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements `"a"` and `"c"` are extracted from a list of lists.

    # regular list of lists
    x = [["a", "b"], ["c", "d"]]
    [x[0][0], x[1][0]]
    
    # numpy
    import numpy as np
    np_x = np.array(x)
    np_x[:, 0]

For regular Python lists, this is a real pain. For 2D numpy arrays, however, it's pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The `:` is for slicing; in this example, it tells Python to include all rows.

The code that converts the pre-loaded `baseball` list to a 2D numpy array is already in the script. The first column contains the players' height in inches and the second column holds player weight, in pounds. Add some lines to make the correct selections. Remember that in Python, the first element is at index 0! baseball is available as a regular list of lists.

### Instructions

- Create `np_baseball_he_we` containing height and weight columns of `baseball`
- Print out the 50th row of `np_baseball_he_we`.
- Make a new variable, `np_weight_lb`, containing the entire second column of `np_baseball`.
- Select the height (first column) of the 124th baseball player in `np_baseball` and print it out.



In [22]:
# Create np_baseball_he_we containing 'Height' and 'Weight' columns
np_baseball_he_we = np_baseball[:, [3, 4]]

# Print out the 50th row of np_baseball
print(np_baseball_he_we[49,])

# Select the entire second column of np_baseball: np_weight_lb
np_weight_lb = np_baseball_he_we[:,1]

# Print out height of 124th player
print(np_baseball_he_we[123,])

[70 195]
[75 200]


## 2D Arithmetic
Remember how you calculated the Body Mass Index for all baseball players? `numpy` was able to perform all calculations element-wise (i.e. element by element). For 2D `numpy` arrays this isn't any different! You can combine matrices with single numbers, with vectors, and with other matrices.

Execute the code below in the IPython shell and see if you understand:

    import numpy as np
    np_mat = np.array([[1, 2],
                       [3, 4],
                       [5, 6]])
    np_mat * 2
    np_mat + np.array([10, 10])
    np_mat + np_mat

`np_baseball_he_we_ye` is coded for you; it's again a 2D `numpy` array with 3 columns representing height (in inches), weight (in pounds) and age (in years).

### Instructions

- You managed to get hold of the changes in height, weight and age of all baseball players. It is available as a 2D `numpy` array, `updated`. Add `np_baseball_he_we_ye` and `updated` and print out the result.
- You want to convert the units of height and weight to metric (meters and kilograms, respectively). As a first step, create a numpy array with three values: `0.0254`, `0.453592` and `1`. Name this array conversion.
- Multiply `np_baseball_he_we_ye` with `conversion` and print out the result.

In [27]:
# Create np_baseball (3 cols)
np_baseball_he_we_ye = np_baseball[:, [3, 4, 5]]

# Create numpy array: conversion
conversion = np.array([0.0254, 0.453592 , 1])

# Print out product of np_baseball_he_we_ye and conversion
np_baseball_he_we_ye[:, :2] *= conversion[:2]  # Apply conversion to Height and Weight columns
print(np_baseball_he_we_ye)

[[1.8796 81.64656 22.99]
 [1.8796 97.52228 34.69]
 [1.8288 95.25431999999999 30.78]
 ...
 [1.905 92.98636 25.19]
 [1.905 86.18248 31.01]
 [1.8541999999999998 88.45044 27.92]]


## Average versus median
You now know how to use `numpy` functions to get a better feeling for your data. It basically comes down to importing `numpy` and then calling several simple functions on the numpy arrays:
    
    import numpy as np
    x = [1, 4, 8, 10, 12]
    np.mean(x)
    np.median(x)

The baseball data is available as a 2D numpy array with 3 columns (height, weight, age) and 1015 rows. The name of this numpy array is `np_baseball_he_we_ye`. After restructuring the data, however, you notice that some height values are abnormally high. Follow the instructions and discover which summary statistic is best suited if you're dealing with so-called outliers. `np_baseball` is available.

### Instructions

- Create numpy array `np_baseball_he_we_ye` that is equal to first column of `np_baseball`.
- Print out the `mean` of `np_baseball_he_we_ye`.
- Print out the `median` of `np_baseball_he_we_ye`.

In [29]:
# Create np_baseball_he_we_ye from np_baseball
np_baseball_he_we_ye = np_baseball[:, [3, 4, 5]]

# Print out the mean of np_baseball_he_we_ye
print(np.mean(np_baseball_he_we_ye))

# Print out the median of np_baseball_he_we_ye
print(np.median(np_baseball_he_we_ye))

101.24892610837442
74.0


## Explore the baseball data
Because the `mean` and `median` are so far apart, you decide to complain to the MLB. They find the error and send the corrected data over to you. It's again available as a 2D NumPy array `np_baseball_he_we_ye`, with three columns.

The Python script in the editor already includes code to print out informative messages with the different summary statistics. Can you finish the job? `np_baseball` is available.

### Instructions

- The code to print out the mean height is already included. Complete the code for the median height. Replace `None` with the correct code.
- Use ***np.std()*** on the first column of `np_baseball` to calculate `stddev`. Replace `None` with the correct code.
- Do big players tend to be heavier? Use ****np.corrcoef()**** to store the correlation between the first and second column of `np_baseball` in `corr`. Replace `None` with the correct code.

In [38]:
# Print mean height (first column)
avg = np.mean(np_baseball_he_we_ye[:,0])
print("Average: " + str(avg))

# Print median height. Replace 'None'
med = np.median(np_baseball_he_we_ye[:,0])
print("Median: " + str(med))

# Print out the standard deviation on height. Replace 'None'
stddev = np.std(np_baseball_he_we_ye[:,0])
print("Standard Deviation: " + str(stddev))

# Print out correlation between first and second column. Replace 'None'
# corr = np.corrcoef(np_baseball_he_we_ye[:,0], np_baseball_he_we_ye[:,1])
# print("Correlation: " + str(corr))

# Ensure height and weight are in float format
heights = np_baseball_he_we_ye[:, 0].astype(float)
weights = np_baseball_he_we_ye[:, 1].astype(float)

# Calculate correlation between height and weight
corr = np.corrcoef(heights, weights)
print("Correlation:", corr)

Average: 73.6896551724138
Median: 74.0
Standard Deviation: 2.3127918810465395
Correlation: [[1.         0.53153932]
 [0.53153932 1.        ]]


## Blend it all together
In the last few exercises you've learned everything there is to know about heights and weights of baseball players. Now it's time to dive into another sport: football.

You've contacted FIFA for some data, and they handed you two lists. The lists are the following:

    positions = ['GK', 'M', 'A', 'D', ...]
    heights = [191, 184, 185, 180, ...]

Each element in the lists corresponds to a player. The first list, `positions`, contains strings representing each player's position. The possible positions are: `'GK'` (goalkeeper), `'M'` (midfield), `'A'` (attack) and `'D'` (defense). The second list, `heights`, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (191 cm).

You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills. `heights` and `positions` are available as lists

### Instructions

- Convert `heights` and `positions`, which are regular lists, to numpy arrays. Call them `np_heights` and `np_positions`.
- Extract all the heights of the goalkeepers. You can use a little trick here: use `np_positions == 'GK'` as an index for `np_heights`. Assign the result to `gk_heights`.
- Extract all the heights of all the other players. This time use `np_positions != 'GK'` as an index for `np_heights`. Assign the result to `other_heights`.
- Print out the median height of the goalkeepers using `np.median()`. Replace `None` with the correct code.
- Do the same for the other players. Print out their median height. Replace `None` with the correct code.

In [3]:
# Read the CSV file with a different encoding
football = pd.read_csv('datasets/FIFA(Football).csv', encoding='ISO-8859-1')

# Remove leading and trailing spaces from all string values in the football DataFrame
football = football.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# Create a numpy array from football: np_football
np_football = np.array(football)

# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = np_football[:, 3]
np_heights = np_football[:, 4]

# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == 'GK']

# Heights of the other players: other_heights
other_heights = np_heights[np_positions != 'GK']

# Print out the median height of goalkeepers. Replace 'None'
print("Median height of goalkeepers: " + str(np.median(gk_heights)))

# Print out the median height of other players. Replace 'None'
print("Median height of other players: " + str(np.median(other_heights)))

Median height of goalkeepers: 188.0
Median height of other players: 181.0
