# Python Libraries
---

The core Python language is by design somewhat minimal. Like other programming languages, Python has an ecosystem of modules (libraries of code) that add functionalities to the base language.

A library can be thought of as a collection of functions and data types that can be accessed to complete certain tasks.

An overview of some important libraries:

* Numpy is a library for working with arrays of data.
* Pandas provides high-performance, easy-to-use data structures and data analysis tools.
* Scipy is a library of techniques for numerical and scientific computing.
* Matplotlib is a library for making graphs.
* Seaborn is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.
* Statsmodels is a library that implements statistical techniques.

This notebook introduces the Pandas and Numpy libraries which are used to manipulate datasets.

## Importing libraries

When using Python, the scripts generally begin by importing libraries that will be used.

The following statements import the Numpy and Pandas libraries, giving them abbreviated names:

In [None]:
import numpy as np
import pandas as pd

## Utilizing library functions

After importing a library, its functions can then be called from your code by prepending the library name to the function name.

For example, to use the `array` function from the `numpy` library, the command entered is `numpy.array`.

To avoid repeatedly having to type the library name in scripts, it is conventional to define a two or three letter abbreviation for each library.

For example, `numpy` is usually abbreviated as `np`. This allows using `np.array` instead of `numpy.array`. Similarly, the Pandas library is typically abbreviated as `pd`.

In [None]:
arr1 = np.array([x for x in range(11)])
print("Mean of {} is {}.".format(arr1, np.mean(arr1)))

Mean of [ 0  1  2  3  4  5  6  7  8  9 10] is 5.0.


In the codeblock above, first the `array` function from the numpy library was used to create a 1-dimensional array, and then the `mean` function from the library was used to calculate its average value.

## NumPy
---

NumPy is a fundamental package for scientific computing with Python. It includes data types for vectors, matrices, and higher-order arrays (tensors), and many commonly-used mathematical functions such as logarithms.


### Numpy Arrays

The ndarray object is an n-dimensional array of values on which different methods can be used to manipulate such arrays.

A Python list may contain values of different types. For example, `[1, "pig", [3.2, 4.5]]` is a Python list containing three elements - an integer, a string, and another list that itself contains two floating point values.

Lists containing inhomogeneous types are convenient, but do not perform well for large-scale numerical computing. The numpy ndarray is a homogeneous array that may have any number of axes. Since it is homogeneous, all values in one ndarray must have the same data type (all values are integers, or all are floating point numbers).

A numpy array is a table of values that may have any number of "axes". A 1-dimensional numpy array has a single axis, and is somewhat analogous to a Python list or a mathematical vector. A 2-dimensional numpy array has two axes, and can be seen as a table or matrix.

Higher-order arrays (tensors) can be useful in specific cases, but are not encountered as often. As noted above, all values in a Numpy array have the same data type.

Numpy arrays are indexed by a sequence of zero-based integer positions -- that is, `x[0]` is the first element of the 1-d array x. The number of axes (dimensions) is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

In [None]:
# Creating a rank-1 numpy array with 1 axes of length 3
arr2 = np.array([10, 20, 30])

# Print object type
print("type(arr2) = {}".format(type(arr2)))

# Print shape
print("arr2 shape = {}".format(arr2.shape))

# Print some values in arr2
print("\nValues in arr2: {}, {}, {}".format(arr2[0], arr2[1], arr2[2]))

# Create a 2x2 numpy array
arr3 = np.array([[10, 20], [100, 200]])

# Print arr3 shape
print("arr3 shape = {}".format(arr3.shape))

# Print some values in arr3
print("Values in arr3: {}, {}, {}".format(arr3[0,0], arr3[0,1], arr3[1,1]))

# Create a 3x2 numpy array
arr4 = np.array([[2, 4], [4, 8], [8, 16]])

# Print arr4 shape
print("\narr4 shape = {}".format(arr4.shape))

# Print some values in arr4
print("Value in first column of arr4: {}, {}, {}".format(arr4[0,0], arr4[1,0], arr4[2, 0]))

type(arr2) = <class 'numpy.ndarray'>
arr2 shape = (3,)

Values in arr2: 10, 20, 30
arr3 shape = (2, 2)
Values in arr3: 10, 20, 200

arr4 shape = (3, 2)
Value in first column of arr4: 2, 4, 8


In [None]:
# Creating 2x3 array containing zeros
arr5 = np.zeros((2,3))
print("arr5 = \n{}\n".format(arr5))

# Creating 4x2 array of ones
arr6 = np.ones([4, 2])
print("arr6 = \n{}\n".format(arr6))

# Creating a 2x2 constant array
arr7 = np.full((2,2), 9)
print("arr7 = \n{}\n".format(arr7))

# Creating a 3x3 random array
arr8 = np.random.random((3, 3))
print("arr8 = \n{}\n".format(arr8))

arr5 = 
[[0. 0. 0.]
 [0. 0. 0.]]

arr6 = 
[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]]

arr7 = 
[[9 9]
 [9 9]]

arr8 = 
[[0.0445088  0.71649971 0.4426652 ]
 [0.37827834 0.77680594 0.74719489]
 [0.98339943 0.61222654 0.43965865]]



### Array indexing and aliasing

It can often be the case that Python arrays may share memory, in which case, changing the values in one array may alter values in another array.

In [None]:
# Creating a 3x4 array
arr9 = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print("arr9 = \n{}\n".format(arr9))

# Slice array to make a 2x2 sub-array
arr10 = arr9[:2, 1:3]
print('arr10 = \n{}\n'.format(arr10))

print("arr9[0, 1] = \n{}\n".format(arr9[0, 1]))

# Modifying the slice
arr10[0, 0] = 1500

print("arr10 = \n{}\n".format(arr10))
print("arr9[0, 1] = {}".format(arr9[0,1]))

arr9 = 
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

arr10 = 
[[2 3]
 [6 7]]

arr9[0, 1] = 
2

arr10 = 
[[1500    3]
 [   6    7]]

arr9[0, 1] = 1500


To ensure that two arrays do not share memory, the `copy` method can be used.

In [None]:
arr11 = np.zeros((3, 3))
arr12 = arr11[0:2, 0:2].copy()

arr11[0, 0] = 7

print("arr11 = \n{}\n".format(arr11))
print("arr12 = \n{}\n".format(arr12))

arr11 = 
[[7. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

arr12 = 
[[0. 0.]
 [0. 0.]]



### Datatypes

Numpy arrays are homogeneous, and the data type shared by all elements can be retrieved by using the `dtype` attribute.

In [None]:
# array with integer data
arr13 = np.array([1, 2])
print(arr13.dtype)

# array with float data
arr14 = np.array([1.0, 2.0])
print(arr14.dtype)

# forcing a datatype
arr15 = np.array([1.0, 2.0], dtype = np.int64)
print(arr15.dtype)

int64
float64
int64


### Array arithmetic

Basic mathematical functions operate element-wise on arrays, and are can be used by operator symbols (+, -, etc.) and as functions in the numpy module.

In [None]:
x = np.array([[1, 2], [3, 4]], dtype = np.float64)
y = np.array([[5, 6], [7, 8]], dtype = np.float64)

# elementwise sum; both produce an array
print("x + y = \n{}\n".format(x + y))
print(np.add(x, y), end="\n\n")

# elementwise difference; both produce an array
print("x - y = \n{}\n".format(x - y))
print(np.subtract(x, y), end="\n\n")

# elementwise product; both produce an array
print("x * y = \n{}\n".format(x * y))
print(np.multiply(x, y), end = "\n\n")

# elementwise division; both produce an array
print("x / y = \n{}\n".format(x / y))
print(np.divide(x, y), end="\n\n")

# elementwise square roo; produces an array
print("sqrt(x) = \n{}\n".format(np.sqrt(x)))

x + y = 
[[ 6.  8.]
 [10. 12.]]

[[ 6.  8.]
 [10. 12.]]

x - y = 
[[-4. -4.]
 [-4. -4.]]

[[-4. -4.]
 [-4. -4.]]

x * y = 
[[ 5. 12.]
 [21. 32.]]

[[ 5. 12.]
 [21. 32.]]

x / y = 
[[0.2        0.33333333]
 [0.42857143 0.5       ]]

[[0.2        0.33333333]
 [0.42857143 0.5       ]]

sqrt(x) = 
[[1.         1.41421356]
 [1.73205081 2.        ]]



In [None]:
x = np.array([[1, 2], [3, 4]])
print("x = \n{}\n".format(x))

# computing the sum of all elements
print("sum(x) = {}\n".format(np.sum(x)))

# computing the sum of each column; returns an array
print("sum(x), axis=0 (col) = {}\n".format(np.sum(x, axis = 0)))

# computing the sum of each row; returns an array
print("sum(x), axis=1 (row) = {}".format(np.sum(x, axis = 1)))

x = 
[[1 2]
 [3 4]]

sum(x) = 10

sum(x), axis=0 (col) = [4 6]

sum(x), axis=1 (row) = [3 7]


In [None]:
x = np.array([[1, 2], [3, 4]])
print("x = \n{}\n".format(x))

# computing the mean of all elements
print("mean(x) = {}\n".format(np.mean(x)))

# computing the mean of each column; returns an array
print("mean(x) for column = {}\n".format(np.mean(x, axis=0)))

# computing the mean of each row; returns an array
print("mean(x) for row = {}\n".format(np.mean(x, axis = 1)))

x = 
[[1 2]
 [3 4]]

mean(x) = 2.5

mean(x) for column = [2. 3.]

mean(x) for row = [1.5 3.5]



## Pandas
---

Numpy is useful for mathematical calculations in which everything is a number. However, we often deal with heterogeneous data including numbers, text, and time values.

Pandas is a library that provides functionality for working with the type of data that frequently arises in real-world. Pandas provides functionality for manipulating data (e.g. transforming values and selecting subsets), summarizing data, reading data to and from files, among many other tasks.

The main data structure that Pandas works with is called a `DataFrame`. This is a two-dimensional table of data in which the rows typically represent cases or observations and the columns represent variables. Pandas also has a one-dimensional data structure called a `Series` encountered when accessing a single column of a Data Frame.

Pandas has a variety of functions named `read_xxx` for reading data in different formats from sources such as files. An example of a file type is `csv` files, where "csv" stands for "comma-separated values".

A csv file is a lot like a spreadsheet, but it is stored in text form, using commas to "delimit" the values in a given row. Other important file formats include `excel`, `json`, and `sql`, just to name a few.


There are many other options to `read_csv` that are very useful. For example, you would use the option `sep='\t'` instead of the default `sep=','` if tabs instead of commas delimited the fields of your data file.

### Importing data

In [None]:
# file string name that holds the .csv file
file_name = '/content/stats_with_python_course/Cartwheeldata.csv'

# reading the .csv file and storing it as Pandas data frame
df = pd.read_csv(file_name)

# printing the object type
print(type(df))

<class 'pandas.core.frame.DataFrame'>


### Viewing data

The top 5 rows of the DataFrame can be viewed by calling the `head()` method.

The last 5 rows of the DataFrame can be viewed by calling the `tail()` method.

In [None]:
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [None]:
df.tail()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
20,21,23,M,2,Y,1,69.0,67.0,66,N,0,2
21,22,29,M,2,N,0,71.0,70.0,101,Y,1,8
22,23,25,M,2,N,0,70.0,68.0,82,Y,1,4
23,24,26,M,2,N,0,69.0,71.0,63,Y,1,5
24,25,23,F,1,Y,1,65.0,63.0,67,N,0,3


The head() method shows the first 5 rows of our DataFrame. If the first 10 rows of the data needs to be seen, then `10` can be passed as an argument to the head method.

In [None]:
df.head(10)

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8


The DataFrame is a 2-dimensional table of values, where each row is an observation in the cartwheel data, and each column is a variable describing some characteristic of the participants.

To see the column names, the `columns` attribute of the data frame can be used:

In [None]:
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

In a dataframe, each column has a single datatype, but different columns may have different datatypes. Datasets in the real world contain variables that have different types, but within a variable, all observations have the same type.

In [None]:
df.dtypes

Unnamed: 0,0
ID,int64
Age,int64
Gender,object
GenderGroup,int64
Glasses,object
GlassesGroup,int64
Height,float64
Wingspan,float64
CWDistance,int64
Complete,object


### Slicing data frames

Like any table, the rows and columns of a Pandas DataFrame can be referred to by position. Since Python always counts from 0, the rows and columns are numbered 0, 1, 2, etc.

Pandas Data Frames also have row and column "indexes" that may be more natural to use than numeric positions in many cases.

For example, if our DataFrame contains information about people, there might be a  column named "Age". Although we may know that the age column is in position 3 (the fourth column due to the zero-based indexing), it is generally preferable to access this column by its name ("Age") rather than by its position (3). One reason for this is that we may at some point manipulate the DataFrame so that the column positions change.

The default index values are simply the positions. Most datasets have informative column names, so it is uncommon to encounter a DataFrame that uses the default column indices.

The most common ways to index and select values from Pandas DataFrames are fairly straightforward, but there are also many advanced indexing techniques.

There are three main ways to "slice" a Data Frame.

1. `.loc()` -- select based on index values
2. `.iloc()` -- select based on positions
3. `.ix()`

### Indexing with .loc()

The `.loc()` method for a DataFrame takes two indexing values separated by ','. The first indexing value selects rows and the second indexing value selects columns. An indexing value may be a single index value, a range of index values, or a list containing one or more index values.

In [None]:
# returning all observations of the variable CWDistance
df.loc[:, 'CWDistance']

Unnamed: 0,CWDistance
0,79
1,70
2,85
3,87
4,72
5,81
6,107
7,98
8,106
9,65


The following syntax is equivalent to the one used above:

In [None]:
df["CWDistance"]

Unnamed: 0,CWDistance
0,79
1,70
2,85
3,87
4,72
5,81
6,107
7,98
8,106
9,65


In the following example, all rows for multiple columns are selected, ["CWDistance", "Height", "Wingspan"]:

In [None]:
df.loc[:, ["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


The syntax below is equivalent.

In [None]:
df[["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


Selecting a limited range of rows for multiple columns.

In [None]:
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


Selecting a limited range of rows for all columns.

In [None]:
df.loc[10:15]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6
11,12,28,F,1,Y,1,62.75,58.0,79,Y,1,10
12,13,25,F,1,Y,1,65.0,64.5,92,Y,1,6
13,14,23,F,1,N,0,61.5,57.5,66,Y,1,4
14,15,31,M,2,Y,1,73.0,74.0,72,Y,1,9
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6


The `.loc()` function requires two arguments, the indices of the rows and the column names to be observed.

In [None]:
df.loc[:9, "CWDistance"]

Unnamed: 0,CWDistance
0,79
1,70
2,85
3,87
4,72
5,81
6,107
7,98
8,106
9,65


### Indexing with .iloc()

The `.iloc()` method is used for position-based slicing.

In [None]:
# first 4 rows with all columns
df.iloc[:4]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10


In [None]:
df.iloc[1:5, 2:4]

Unnamed: 0,Gender,GenderGroup
1,F,1
2,F,1
3,F,1
4,M,2


The position-based slicing in the rows and label-based indexing in the columns can be used together.

In [None]:
df.iloc[:5, :][["Gender"]]

Unnamed: 0,Gender
0,F
1,F
2,F
3,F
4,M


Often, one might one to observe the different unique values within a specific column.

In [None]:
# listing all unique values in the df['Gender'] column
df["Gender"].unique()

array(['F', 'M'], dtype=object)

Doing the same thing for another variable called `GenderGroup`.

In [None]:
df["GenderGroup"].unique()

array([1, 2])

It seems the variables `Gender` and `GenderGroup` contain the same information in different coding schemes. This can be confirmed by doing a cross tabulation using `crosstab` function in Pandas.

In [None]:
pd.crosstab(df["Gender"], df["GenderGroup"])

GenderGroup,1,2
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,12,0
M,0,13


From the result above, it is clear that everyone whose Gender is `F` has a GenderGroup value of `1`, and everyone whose gender is `M` has a GenderGroup value of `2`.

The same result can be obtained by using the `groupby()` and `size` methods.

In [None]:
df.groupby(['Gender', 'GenderGroup']).size()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Gender,GenderGroup,Unnamed: 2_level_1
F,1,12
M,2,13


Again, the output indicates that there are two combinations:

* Case 1: Gender = F and GenderGroup = 1
* Case 2: Gender = M and GenderGroup = 2