<a href="https://colab.research.google.com/github/saffarizadeh/INSY4054/blob/main/Python_Basics_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://saffarizadeh.com/Logo.png" width="300px"/>

# *INSY 4054: Emerging Technologies*

# **Python Basics III**

Instructor: Dr. Kambiz Saffarizadeh

---

Most of the credit for this notebook goes to McIntire et al.: https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

Also Pandas contributors: https://github.com/pandas-dev/pandas/graphs/contributors

#Numpy

`numpy` is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy.

## Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [None]:
import numpy as np

In [None]:
a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"

In [None]:
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

In [None]:
b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b)
print(b.shape)                     # Prints "(2, 3)"

In [None]:
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

### Array indexing
Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

Create the following rank 2 array with shape (3, 4)

[[ 1  2  3  4]

 [ 5  6  7  8]
 
 [ 9 10 11 12]]

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

Use slicing to pull out the subarray consisting of the first 2 rows and columns 1 and 2; b is the following array of shape (2, 2):

[[2 3]

 [6 7]]

In [None]:
b = a[:2, 1:3]

A slice of an array is a view into the same data, so modifying it will modify the original array.

In [None]:
print(a[0, 1])   # Prints "2"

In [None]:
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]

In [None]:
print(a[0, 1])   # Prints "77"

You can select a single row or column of an array too:

In [None]:
row_r1 = a[1, :]    # Rank 1 view of the second row of a
print(row_r1)

In [None]:
col_r1 = a[:, 2]    # Rank 1 view of the third column of a
print(col_r1)

Boolean array indexing: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])
print(a)

In [None]:
bool_idx = (a > 2)   # Find the elements of a that are bigger than 2;
                     # this returns a numpy array of Booleans of the same
                     # shape as a, where each slot of bool_idx tells
                     # whether that element of a is > 2.
print(bool_idx)

We use boolean array indexing to construct a rank 1 array consisting of the elements of a corresponding to the True values of bool_idx

In [None]:
print(a[bool_idx])

We can do all of the above in a single concise statement:

In [None]:
print(a[a > 2])     # Prints "[3 4 5 6]"

## Numpy Datatypes

https://numpy.org/doc/stable/reference/arrays.dtypes.html

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [None]:
x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints "int64"

x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints "float64"

x = np.array([1, 2], dtype=np.int64)   # Force a particular datatype
print(x.dtype)

We will learn more about Numpy in our Machine Learning exercises.

# Pandas

In [None]:
import numpy as np
import pandas as pd

## Core components of pandas: Series and DataFrames

The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

<img src="https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png" width=50%>

### Creating columns (Series) from scratch

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Creating Series of dates:

In [None]:
dates = pd.date_range('20210101', periods=6)
dates

Creating a categorical column:

In [None]:
test_train = pd.Categorical(["test", "train", "test", "train", "train", "train"])

### Creating DataFrames from Columns

In [None]:
table_dict = {'Date': dates, 'Type_of_Learning': test_train, 'Value': s}

In [None]:
df1 = pd.DataFrame(table_dict)
df1

### Creating DataFrames from scratch

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

In [None]:
data = {
        'apples': [3, 2, 0, 1], 
        'oranges': [0, 3, 7, 2]
        }

In [None]:
purchases = pd.DataFrame(data)

In [None]:
purchases

How did that work?

Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [None]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

So now we could locate a customer's order by using their name:

In [None]:
purchases.loc['June']

## How to read in data

### Reading data from CSVs

In [None]:
df2 = pd.read_csv('http://saffarizadeh.com/ET/purchases.csv')

In [None]:
df2

CSVs don't have indexes like our DataFrames, so all we need to do is just designate the `index_col` when reading:

In [None]:
df2 = pd.read_csv('https://saffarizadeh.com/ET/purchases.csv', index_col=0)

df2

### Reading data from JSON

If you have a JSON file — which is essentially a stored Python `dict` — pandas can read this just as easily:

In [None]:
df3 = pd.read_json('https://saffarizadeh.com/ET/purchases.json')

In [None]:
df3

Notice this time our index came with us correctly since using JSON allowed indexes to work through nesting. Feel free to open `purchases.json` in a notepad so you can see how it works.

### Converting back to a CSV or JSON

So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice. Similar to the ways we read in data, pandas provides intuitive commands to save it:

In [None]:
df2['apples'][0] = 999
df2.to_csv('new_purchases.csv')

In [None]:
df3.to_json('new_purchases.json')

### Reading data from Excel

In [None]:
excel_file_address = 'https://saffarizadeh.com/ET/Students.xlsx'

In [None]:
students_sheet1 = pd.read_excel(excel_file_address, sheet_name=0, index_col=0)

## Exploring the DataFrame

### `head` and `tail`

`.head()` outputs the **first** five rows of your DataFrame by default, but we could also pass a number as well: `movies_df.head(10)` would output the top ten rows, for example. 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [None]:
students_sheet1.head(2)

To see the **last** five rows use `.tail()`. `tail()` also accepts a number, and in this case we printing the bottom two rows.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

In [None]:
students_sheet1.tail(2)

In [None]:
students_sheet1.shape

### `info` and `describe`

`.info()` should be one of the very first commands you run after loading your data.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

In [None]:
students_sheet1.info()

`describe()` shows a quick statistic summary of your data. Using `describe()` on an entire DataFrame we can get a summary of the distribution of continuous variables. `.describe()` can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [None]:
students_sheet1.describe()

### `index` and `columns`

In [None]:
students_sheet1.index

In [None]:
students_sheet1.columns

## Slicing Rows

In [None]:
students_sheet1[0:2]

### `loc` and `iloc`

- `.loc` - **loc**ates by row name, which may or may not be a number (Selection by Label)
- `.iloc`- **loc**ates by row **i**ndex (Selection by Position)

One important distinction between using `.loc` and `.iloc` to select multiple rows is that:
- `.loc` is both ends inclusive
- `.iloc` is inclusive start, exclusive end (similar to Python lists and numpy arrays)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

In [None]:
students_sheet1.loc[1:2]

In [None]:
students_sheet1.iloc[0:2]

### Slicing Columns

We can slice the columns by selecting via `[]`.

In [None]:
students_sheet1["Quiz 1"]

This will return a *Series*. To extract a column as a *DataFrame*, you need to pass a list of column names. In our case that's a list of just a single column.

In [None]:
students_sheet1[["Quiz 1"]]

You can also pass any other list of column names.

In [None]:
students_sheet1[students_sheet1.columns[1:4]] # How does this work?

### `loc` and `iloc`

In [None]:
students_sheet1.loc[:, "Last Name": "Quiz 2"]

In [None]:
students_sheet1.iloc[:, 1:4]

**Find Column Names that Contain a Specific Keyword**

In [None]:
quiz_columns = [column for column in students_sheet1.columns if "Quiz" in column]

In [None]:
students_sheet1[quiz_columns]

### Access a single value for a row/column pair

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html

Using the methods we've already learned:

In [None]:
students_sheet1["Quiz 1"][1]

In [None]:
students_sheet1["Quiz 1"].iloc[0]

In [None]:
students_sheet1.loc[1, "Quiz 1"]

In [None]:
students_sheet1.iloc[0, 2]

Using `at` and `iat` (preferred method):

In [None]:
students_sheet1.at[1, "Quiz 1"]

In [None]:
students_sheet1.iat[0, 2]