# Introduction to Numpy/Pandas

First, we begin by importing in the library. Typically, it is imported with an abbreviation that makes it easier to call it's functions afterwards.

1. **NumPy:** Python library that is used for efficient numerical computations on large datasets. These computations are done by using special functions and operations that are optimized on the NumPy arrays. 


2. **Pandas:** Python library meant for data analysis and manipulation. The framework in this library is built on top of NumPy and uses the arrays for building series or dataframes. 

In [None]:
import pandas as pd
import numpy as np

## NumPy Arrays

These arrays are more flexible than lists. They can be of similar shape to a list i.e (1 x 3) or they can have more rows and be of similar shape to a matrix. 

In [None]:
# Create a list
lst = [1,2,3,4]

In [None]:
# One-dimensional array


In [None]:
# Two-dimensional array


In [None]:
# Another example of 2D array


And so on.. 

We can also reshape our initial array into a different vector as long as it distribute evenly.

In [None]:
# We can reshape to a 2x2 matrix


In [None]:
# Check dimensions


In [None]:
# Check the shape (column,row)


### Indexing

The syntax of indexing is the same as for lists.

In [None]:
# Index first element


In [None]:
# Slicer: include first index, exclude second index


### Data Types

Your array can be composed of many different types of data. You can check what the DataType is with a simple function.

In [None]:
# Datatype


Hence, our array is made up of integers. Similarly to how we can change data types of variables, we can change datatypes of arrays using `astype`.

In [None]:
# Change data type to string


Now, the rest of this tutorial will be focused on how to use pandas. Pandas use the tools from NumPy to analyze large datasets. 

We will begin with an introduction to pandas and then go through the remainder of the lesson focused on applying functions to data from Yahoo Finance.

## Pandas

Pandas is the go-to tool to use for cleaning, transforming and analysing data. It will allow you to create datasets, read in datasets, manipulate them, calculate stummary statistics, remove missing values, etc. We will cover some of these topics in this course. 

We started with NumPy because Pandas is built on top of the NumPy package so the structure and functions are very similar.

The two main components of pandas are **Series** and **DataFrames.** Functions that apply to Series can be carried over and many of them share similar syntax when applying them to DataFrames.

### Series

A **Series** is a list of values that acts very similarly to a NumPy array. This array is displayed as a single column.

You can let pandas create a default index or specify an index. There are many ways to create a Series.

In [None]:
# Create an array
array1 = np.array([1,2,3,4])

In [None]:
# Default integer index to create a series


In [None]:
# Set an index for series


In [None]:
# Define dictionary
d = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

# Create series


In [None]:
# Check data type


### DataFrame

A **DataFrame** on the other hand is a collection of these Series displayed as a multi-dimensional table.

#### Creating DataFrame

In [None]:
# Define dictionary
data = {'student_id': [27139,39978,35631,98632,80272,57815,37820,19711,21270,23647],
        'assignment1': [1, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87]
       }

# Create DataFrame


In [None]:
# Define dictionary
student_id = pd.Series([27139,39978,35631,98632,80272,57815,37820,19711,21270,23647])
a1 = pd.Series([1, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87])

# Create DataFrame


There are a lot of other methods in which you can create a DataFrame if you are interested. Dataframe documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

#### Displaying Values

In [None]:
# Display first five values of DataFrame


In [None]:
# Display last five values of DataFrame 


#### Working with columns

In [None]:
# What columns do we have?


In [None]:
# Set the index to be a column


#### Please note:
"inplace=True" means to replace the original dataframe stored in the 'grades' with the new dataframe. You can treat it as:

> grades = grades.set_index(student_id)

In [None]:
# Drop duplicate column


In [None]:
# View the data again


In [None]:
# Extract the assignment column


In [None]:
# Get the value as a Series


**Important reminder:**

If there are spaces in your column name, the use the `df[column name]` method. If there are no spaces, then you can use the `df.column_name` method to extract a column

In [None]:
# Extract a column values


Notice that when we extract the data values, we have a NumPy array!

Next, let's look at adding a new column in. Let's say you want to add in another assignment into the grades for these 10 students. 

In [None]:
# Extract index


In [None]:
# Reset index


In [None]:
# Create a new Series of data
assignment2 = pd.Series([0.95, 0.80, 0.78, 0.75, 0.90, 0.85, 0.77, 0.94, 0.88, 0.90])

In [None]:
# Add Series to dataframe as a new column

# View data


In [None]:
# Renaming the columns


Now we have two assignments in our data. Note that the index needs to match to be able to add a column into our DataFrame, hence why I set the index to match the previous students.

### Transforming Data

Next, for transforming data, you can apply simple mathematical functions to your columns, i.e. you can multiple, add, subtract, divide an entire column by a value or by another column.

In [None]:
# Grade increase from assignment 1 to assignment 2 


In [None]:
# Multiply all assignment grades by 100 to get them in percentages


An alternate method would have been to do grades * 100 to get all columns multiplied by 100.

In [None]:
# View the data


#### Selecting Data

Now for selecting data, the best methods with DataFrames are `.loc` and `.iloc`. Use `:` if you want to return all the data.

**`.loc`**
* Label- based
* Specify rows and columns by their row and column labels

**`.iloc`**
* Integer-based
* Specify rows and columns based on their integer position (starts at 0)

To understand this better, let's say you wanted to get the assignment 2 grade for student with ID number 5.

In [None]:
# Set the index


In [None]:
# Extract assignment 2 grade


In [None]:
# Equivalent iloc statement


In [None]:
# Use .loc with a column and slicing


In [None]:
# Get all rows


In [None]:
# Get all columns


Now, let's say that student 27139 was incorrectly given a 100, we want to replace their assignment 1 grade with a 98.

In [None]:
# Replace 100 with 98
grades.loc[27139, 'assignment1'] = 98

In [None]:
# Check values for student 27139
grades.loc[27139]

We can replace values by using .loc or .iloc to select their specific value and then re-assign the value.

Another important concept with selecting data is the idea of **masking**. This is essentially filtering your data based on the columns.

`dataframe[mask]` 

In [None]:
# Assignment1  less than 50
grades[grades.assignment1 <= 50]

No students failed the assignment! 

In [None]:
# Filtering with AND 
# Assignment 1 between 60 and 70
grades[(grades.assignment1 < 70) & (grades.assignment1 >= 60)]

In [None]:
(grades.assignment1 < 70)

In [None]:
# Filtering with OR 
# Assignment 1 below 70 or Assignment 2 below 70
grades[(grades.assignment1 < 70) | (grades.assignment2 < 70)]

In [None]:
# Filtering based on equality
grades[grades.assignment1 == 75]

In [None]:
# Using filters with loc
grades.loc[(grades.assignment1 < 70) | (grades.assignment2 < 70)]

In [None]:
# Get the diff column with loc 
grades.loc[(grades.assignment1 < 70) | (grades.assignment2 < 70), 'diff']

#### The entire grades
You can also display the entire dataframe. However, we do not recommend this on assignment. Please display only the first 5 or last 5, as will be specified in each assignment question.

In [None]:
grades

## Exercises

Exercise 1: Calculate the average grade of each students based on the two assignments, and show the average as a number between 0 and 100. Please do NOT find the average for each student one by one.

Exercise 2: Get a subset of the students that are above a 85 average, where assignment grades are numbers between 0 and 100.

Additional exercises will be provided after the tutorial for participation marks

In [1]:
# Given the dataframe for grades defined as follow
data = {'student_id': [27139,39978,35631,98632,80272,57815,37820,19711,21270,23647],
        'assignment1': [1.00, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87],
        'assignment2': [0.95, 0.80, 0.78, 0.75, 0.90, 0.85, 0.77, 0.94, 0.88, 0.90],
        'assignment3': [0.90, 0.90, 0.50, 0.80, 0.60, 0.95, 0.77, 0.95, 0.99, 0.80]}
grades = pd.DataFrame(data)

NameError: name 'pd' is not defined

In [None]:
# Exercise 1


In [None]:
# Exercise 2
