# Introduction to Numpy/Pandas

First, we begin by importing in the library. Typically, it is imported with an abbreviation that makes it easier to call it's functions afterwards.

1. **NumPy:** Python library that is used for efficient numerical computations on large datasets. These computations are done by using special functions and operations that are optimized on the NumPy arrays. 


2. **Pandas:** Python library meant for data analysis and manipulation. The framework in this library is built on top of NumPy and uses the arrays for building series or dataframes. 

In [1]:
import pandas as pd
import numpy as np

## NumPy Arrays

These arrays are more flexible than lists. They can be of similar shape to a list i.e (1 x 3) or they can have more rows and be of similar shape to a matrix. 

In [2]:
# Create a list
lst = [1,2,3,4]

In [3]:
# One-dimensional array
array1 = np.array(lst)

array1

array([1, 2, 3, 4])

In [4]:
# Two-dimensional array
array2 = np.array([lst, lst])

array2

array([[1, 2, 3, 4],
       [1, 2, 3, 4]])

In [5]:
np.array([[1,2],
         [1,2]])

array([[1, 2],
       [1, 2]])

And so on.. 

We can also reshape our initial array into a different vector as long as it distribute evenly.

In [6]:
# We can reshape to a 2x2 matrix
array1.reshape((2,2))

array([[1, 2],
       [3, 4]])

In [7]:
# Check dimensions
array2.ndim

2

In [8]:
# Check the shape (column,row)
array1.shape

(4,)

### Indexing

The syntax of indexing is the same as for lists.

In [9]:
# Index first element
array1[0]

1

In [10]:
# Slicer: include first index, exclude second index
array1[0:2]

array([1, 2])

### Data Types

Your array can be composed of many different types of data. You can check what the DataType is with a simple function.

In [11]:
# Datatype
array1.dtype

dtype('int32')

Hence, our array is made up of integers. Similarly to how we can change data types of variables, we can change datatypes of arrays using `astype`.

In [12]:
# Change data type to string
array1.astype('str')

array(['1', '2', '3', '4'], dtype='<U11')

Now, the rest of this tutorial will be focused on how to use pandas. Pandas use the tools from NumPy to analyze large datasets. 

We will begin with an introduction to pandas and then go through the remainder of the lesson focused on applying functions to data from Yahoo Finance.

## Pandas

Pandas is the go-to tool to use for cleaning, transforming and analysing data. It will allow you to create datasets, read in datasets, manipulate them, calculate stummary statistics, remove missing values, etc. We will cover some of these topics in this course. 

We started with NumPy because Pandas is built on top of the NumPy package so the structure and functions are very similar.

The two main components of pandas are **Series** and **DataFrames.** Functions that apply to Series can be carried over and many of them share similar syntax when applying them to DataFrames.

### Series

A **Series** is a list of values that acts very similarly to a NumPy array. This array is displayed as a single column.

You can let pandas create a default index or specify an index. There are many ways to create a Series.

In [13]:
# Default integer index to create a series
pd.Series(array1)

0    1
1    2
2    3
3    4
dtype: int32

In [14]:
# Set an index for series
pd.Series(array1, index=['a', 'b', 'c', 'd'])

a    1
b    2
c    3
d    4
dtype: int32

In [15]:
# Define dictionary
d = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

# Create series
pd.Series(d)

a    1
b    2
c    3
d    4
dtype: int64

In [16]:
# Check data type
pd.Series(d).dtype

dtype('int64')

### DataFrame

A **DataFrame** on the other hand is a collection of these Series displayed as a multi-dimensional table.

#### Creating DataFrame

In [17]:
# Define dictionary
data = {'student_id': [27139,39978,35631,98632,80272,57815,37820,19711,21270,23647],
        'assignment1': [1, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87]
       }

# Create DataFrame
grades = pd.DataFrame(data)

grades.head()

Unnamed: 0,student_id,assignment1
0,27139,1.0
1,39978,0.85
2,35631,0.65
3,98632,0.8
4,80272,0.75


In [18]:
# Define dictionary
student_id = pd.Series([27139,39978,35631,98632,80272,57815,37820,19711,21270,23647])
a1 = pd.Series([1, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87])

# Create DataFrame
grades2 = pd.DataFrame({'student_id': student_id, 
                        'assignment1': a1})

grades2.head()

Unnamed: 0,student_id,assignment1
0,27139,1.0
1,39978,0.85
2,35631,0.65
3,98632,0.8
4,80272,0.75


There are a lot of other methods in which you can create a DataFrame if you are interested. Dataframe documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

#### Displaying Values

In [19]:
# Display first five values of DataFrame
grades.head()

Unnamed: 0,student_id,assignment1
0,27139,1.0
1,39978,0.85
2,35631,0.65
3,98632,0.8
4,80272,0.75


In [20]:
# Display last five values of DataFrame 
grades.tail()

Unnamed: 0,student_id,assignment1
5,57815,0.6
6,37820,0.75
7,19711,0.92
8,21270,0.97
9,23647,0.87


#### Working with columns

In [21]:
# What columns do we have?
grades.columns

Index(['student_id', 'assignment1'], dtype='object')

In [22]:
# Set the index to be a column
grades.set_index(student_id, inplace=True)

grades.head()

Unnamed: 0,student_id,assignment1
27139,27139,1.0
39978,39978,0.85
35631,35631,0.65
98632,98632,0.8
80272,80272,0.75


#### Please note:
"inplace=True" means to replace the original dataframe stored in the 'grades' with the new dataframe. You can treat it as:

> grades = grades.set_index(student_id)

In [23]:
# Drop duplicate column
grades.drop(columns=['student_id'], inplace=True)

In [24]:
# View the data again
grades.head()

Unnamed: 0,assignment1
27139,1.0
39978,0.85
35631,0.65
98632,0.8
80272,0.75


In [25]:
grades2.head()

Unnamed: 0,student_id,assignment1
0,27139,1.0
1,39978,0.85
2,35631,0.65
3,98632,0.8
4,80272,0.75


In [26]:
# Extract the assignment column
grades['assignment1']

27139    1.00
39978    0.85
35631    0.65
98632    0.80
80272    0.75
57815    0.60
37820    0.75
19711    0.92
21270    0.97
23647    0.87
Name: assignment1, dtype: float64

In [27]:
# Get the value as a Series
grades.assignment1

27139    1.00
39978    0.85
35631    0.65
98632    0.80
80272    0.75
57815    0.60
37820    0.75
19711    0.92
21270    0.97
23647    0.87
Name: assignment1, dtype: float64

**Important reminder:**

If there are spaces in your column name, the use the `df[column name]` method. If there are no spaces, then you can use the `df.column_name` method to extract a column

In [28]:
# Extract a column values
grades.assignment1.values

array([1.  , 0.85, 0.65, 0.8 , 0.75, 0.6 , 0.75, 0.92, 0.97, 0.87])

Notice that when we extract the data values, we have a NumPy array!

Next, let's look at adding a new column in. Let's say you want to add in another assignment into the grades for these 10 students. 

In [29]:
# Extract index
grades.index

Int64Index([27139, 39978, 35631, 98632, 80272, 57815, 37820, 19711, 21270,
            23647],
           dtype='int64')

In [30]:
# Reset index
grades.reset_index(inplace=True)

In [31]:
grades.head()

Unnamed: 0,index,assignment1
0,27139,1.0
1,39978,0.85
2,35631,0.65
3,98632,0.8
4,80272,0.75


In [32]:
# Create a new Series of data
assignment2 = pd.Series([0.95, 0.80, 0.78, 0.75, 0.90, 0.85, 0.77, 0.94, 0.88, 0.90])

assignment2

0    0.95
1    0.80
2    0.78
3    0.75
4    0.90
5    0.85
6    0.77
7    0.94
8    0.88
9    0.90
dtype: float64

In [33]:
# Add Series to dataframe as a new column
grades['assignment2'] = assignment2

# View data
grades.head()

Unnamed: 0,index,assignment1,assignment2
0,27139,1.0,0.95
1,39978,0.85,0.8
2,35631,0.65,0.78
3,98632,0.8,0.75
4,80272,0.75,0.9


In [34]:
# Renaming the columns

grades.columns = ['student_id', 'assignment1', 'assignment2']

In [35]:
grades.head()

Unnamed: 0,student_id,assignment1,assignment2
0,27139,1.0,0.95
1,39978,0.85,0.8
2,35631,0.65,0.78
3,98632,0.8,0.75
4,80272,0.75,0.9


Now we have two assignments in our data. Note that the index needs to match to be able to add a column into our DataFrame, hence why I set the index to match the previous students.

### Transforming Data

Next, for transforming data, you can apply simple mathematical functions to your columns, i.e. you can multiple, add, subtract, divide an entire column by a value or by another column.

In [36]:
# Grade increase from assignment 1 to assignment 2 
grades['diff'] = grades.assignment2 - grades.assignment1

grades.head()

Unnamed: 0,student_id,assignment1,assignment2,diff
0,27139,1.0,0.95,-0.05
1,39978,0.85,0.8,-0.05
2,35631,0.65,0.78,0.13
3,98632,0.8,0.75,-0.05
4,80272,0.75,0.9,0.15


In [37]:
# Multiply all assignment grades by 100 to get them in percentages
grades['assignment1'] *= 100
grades['assignment2'] *= 100
grades['diff'] *= 100

An alternate method would have been to do grades * 100 to get all columns multiplied by 100.

In [38]:
# View the data
grades.head()

Unnamed: 0,student_id,assignment1,assignment2,diff
0,27139,100.0,95.0,-5.0
1,39978,85.0,80.0,-5.0
2,35631,65.0,78.0,13.0
3,98632,80.0,75.0,-5.0
4,80272,75.0,90.0,15.0


#### Selecting Data

Now for selecting data, the best methods with DataFrames are `.loc` and `.iloc`. Use `:` if you want to return all the data.

**`.loc`**
* Label- based
* Specify rows and columns by their row and column labels

**`.iloc`**
* Integer-based
* Specify rows and columns based on their integer position (starts at 0)

To understand this better, let's say you wanted to get the assignment 2 grade for student with ID number 5.

In [39]:
grades.set_index(student_id, inplace=True)
grades.head()

Unnamed: 0,student_id,assignment1,assignment2,diff
27139,27139,100.0,95.0,-5.0
39978,39978,85.0,80.0,-5.0
35631,35631,65.0,78.0,13.0
98632,98632,80.0,75.0,-5.0
80272,80272,75.0,90.0,15.0


In [40]:
# Extract assignment 2 grade
grades.loc[80272, 'assignment2']

90.0

In [41]:
# Equivalent iloc statement
grades.iloc[4, 2]

90.0

In [42]:
# Use .loc with a column and slicing
grades.assignment1.loc[27139:98632]

27139    100.0
39978     85.0
35631     65.0
98632     80.0
Name: assignment1, dtype: float64

In [43]:
# Get all rows
grades.loc[:,'assignment1']

27139    100.0
39978     85.0
35631     65.0
98632     80.0
80272     75.0
57815     60.0
37820     75.0
19711     92.0
21270     97.0
23647     87.0
Name: assignment1, dtype: float64

In [44]:
# Get all columns
grades.loc[27139, :]

student_id     27139.0
assignment1      100.0
assignment2       95.0
diff              -5.0
Name: 27139, dtype: float64

Now, let's say that student 27139 was incorrectly given a 100, we want to replace their assignment 1 grade with a 98.

In [45]:
# Replace 100 with 98
grades.loc[27139, 'assignment1'] = 98

In [46]:
# Check values for student 27139
grades.loc[27139]

student_id     27139.0
assignment1       98.0
assignment2       95.0
diff              -5.0
Name: 27139, dtype: float64

We can replace values by using .loc or .iloc to select their specific value and then re-assign the value.

Another important concept with selecting data is the idea of **masking**. This is essentially filtering your data based on the columns.

`dataframe[mask]` 

In [47]:
# Assignment1  less than 50
grades[grades.assignment1 <= 50]

Unnamed: 0,student_id,assignment1,assignment2,diff


No students failed the assignment! 

In [48]:
# Filtering with AND 
# Assignment 1 between 60 and 70
grades[(grades.assignment1 < 70) & (grades.assignment1 >= 60)]

Unnamed: 0,student_id,assignment1,assignment2,diff
35631,35631,65.0,78.0,13.0
57815,57815,60.0,85.0,25.0


In [49]:
(grades.assignment1 < 70)

27139    False
39978    False
35631     True
98632    False
80272    False
57815     True
37820    False
19711    False
21270    False
23647    False
Name: assignment1, dtype: bool

In [50]:
# Filtering with OR 
# Assignment 1 below 70 or Assignment 2 below 70
grades[(grades.assignment1 < 70) | (grades.assignment2 < 70)]

Unnamed: 0,student_id,assignment1,assignment2,diff
35631,35631,65.0,78.0,13.0
57815,57815,60.0,85.0,25.0


In [51]:
# Filtering based on equality
grades[grades.assignment1 == 75]

Unnamed: 0,student_id,assignment1,assignment2,diff
80272,80272,75.0,90.0,15.0
37820,37820,75.0,77.0,2.0


In [52]:
# Using filters with loc
grades.loc[(grades.assignment1 < 70) | (grades.assignment2 < 70)]

Unnamed: 0,student_id,assignment1,assignment2,diff
35631,35631,65.0,78.0,13.0
57815,57815,60.0,85.0,25.0


In [53]:
# Get the diff column with loc 
grades.loc[(grades.assignment1 < 70) | (grades.assignment2 < 70), 'diff']

35631    13.0
57815    25.0
Name: diff, dtype: float64

#### The entire grades
You can also display the entire dataframe. However, we do not recommend this on assignment. Please display only the first 5 or last 5, as will be specified in each assignment question.

In [54]:
grades

Unnamed: 0,student_id,assignment1,assignment2,diff
27139,27139,98.0,95.0,-5.0
39978,39978,85.0,80.0,-5.0
35631,35631,65.0,78.0,13.0
98632,98632,80.0,75.0,-5.0
80272,80272,75.0,90.0,15.0
57815,57815,60.0,85.0,25.0
37820,37820,75.0,77.0,2.0
19711,19711,92.0,94.0,2.0
21270,21270,97.0,88.0,-9.0
23647,23647,87.0,90.0,3.0


### Read dataset from CSV / Excel

You can also read in data as a DataFrame, for example from a csv file. 

In [55]:
# read in daily activity csv
activity = pd.read_csv('daily_activity.csv', index_col='Date')

In [56]:
# View our data
activity.head()

Unnamed: 0_level_0,Walk,Swim,Running
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,89,36,26
2021-01-02,78,39,29
2021-01-03,68,39,27
2021-01-04,93,30,26
2021-01-05,68,26,26
